Adaptive and generic parallel computation of echelon forms


Séminaire Modèles et Algorithmes Déterministes: CASYS

26/11/2015 - 09:30 Mr Ziad Sultan Salle 1 - Tour IRMA

This work deals with parallelization of Gaussian elimination over a finite field Z/pZ. This computation is motivated by numerous applications, including algebraic attacks with polynomial system solving, index calculus attacks on the discrete logarithm, etc.

We propose efficient parallel algorithms and implementations of LU factorization over word size prime fields and target shared memory architectures. With respect to numerical linear algebra, we have identified three major differences, specific to exact linear algebra.

First, the arithmetic complexity could be dominated by modular reductions. Therefore, it is mandatory to delay as much as possible these reductions. Delaying the latter can be done by accumulating several multiplications before reducing while keeping the result exact. This induces splitting the matrices in blocks of large size to accumulate multiplication operations as much as possible, without overflow.

Second, over a finite field stability is not an issue contrarily to numerical linear algebra. Therefore, asymptotically fast matrix multiplication algorithms, like Winograd's variant of Strassen's fast algorithm, can more easily be used on top of the BLAS. Their speed-up increases with matrix dimension. Thus, on one hand we need to choose sufficiently large blocks well suited to those fast variants and to delay modular reductions. On the other hand, a finer grain allows more flexibility to the scheduler for load and communication balancing.

Third, many applications over finite fields require the rank profile of the matrix which is a key invariant used in many applications using exact Gaussian elimination, such as Gröbner basis computations and computational number theory. Moreover, the rank profile is only discovered during the algorithm and thus the rank deficiency of some blocks leads to unpredictable dimensions of the tiles or slabs of the matrix. This creates challenges on the efficiency of the underlying scheduler to handle and balance dynamic workloads of heterogeneous tasks.

We thus propose and compare different parallelization strategies and block decomposition to tackle these issues: tile iterative with left-looking, right-looking and Crout variants with delayed modular reductions; slab and tile recursive variants; dynamic/static block cutting and adaptive coupling of the different variants. Experiments demonstrate that the tile recursive variant performs better and matches the performance of reference numerical software when no rank deficiency occur. Furthermore, even in the most heterogeneous case, namely when all pivot blocks are rank deficient, we show that it is possible to maintain a high efficiency.