A hybrid recursive multilevel incomplete factorization preconditioner for solving general linear systems

In this paper we introduce an algebraic recursive multilevel incomplete factorization preconditioner, based on a distributed Schur complement formulation, for solving general linear systems. The novelty of the proposed method is to combine factorization techniques of both implicit and explicit type, recursive combinatorial algorithms, multilevel mechanisms and overlapping strategies to maximize sparsity in the inverse factors and consequently reduce the factorization costs. Numerical experiments demonstrate the good potential of the proposed solver to precondition effectively general linear systems, also against other state-of-the-art iterative solvers of both implicit and explicit form.


Introduction
Krylov subspace methods may be considered the method of choice for solving large and sparse systems of linear equations arising from the discretization of (systems of) partial differential equations on modern parallel computers. This class of algorithms are iterative in nature. At every step k, they compute the approximate solution x k of a linear system Ax = b from the Krylov subspace of dimension k K k (A, b) = span{b, Ab, A 2 b, . . . , A k−1 b}, according to different criteria for each given method. The computation requires matrix-vector products with the coefficient matrix A plus vector operations, thus potentially reducing the cumbersome costs of sparse direct solvers on large problems, especially in terms of memory. All of the iterative Krylov methods converge rapidly if A is somehow close to the identity. Therefore, it is common replacing the original system Ax = b by or for a nonsingular matrix M ≈ A. Systems (1) and (2) are referred to as left and right preconditioned systems, respectively, and M as the preconditioner matrix.
In the case M is factorized as the product of two sparse matrices, M = M 1 M 2 , like in the Hermitian and positive definite case, one might solve the modified linear system If one may choose M so that M −1 A, AM −1 or M −1 1 AM −T 2 approximate the identity, and linear systems with M or with M 1 and M 2 as coefficient matrices are easy to invert, it is more efficient to apply a Krylov subspace method to the modified linear system.
Optimal analytic preconditioners based on low order discretizations, nearby equations that are simple to solve, or similar ideas have been proposed in the literature for specific problems. However, the problem-specific approach is generally sensitive to the characteristics of the underlying operator and to the details of the geometry. In this study, we pursue an algebraic approach where the preconditioner M is computed only from the coefficient matrix A. Although not optimal for any specific problem, algebraic methods are universally applicable, they can be adapted to different operators and to changes in the geometry by tuning a few parameters, and are well suited for solving irregular problems defined on unstructured grids.
Roughly speaking, most of the existing techniques can be divided into either implicit or explicit form. A preconditioner of implicit form is defined by any nonsingular matrix M ≈ A, and requires to solve an extra linear system with M at each step of an iterative method. The most important example in this class is represented by the Incomplete LU decomposition (ILU), where M is implicitly defined as M =LŪ ,L andŪ being triangular matrices that approximate the exact L and U factors of A according to a prescribed dropping strategy adopted during the Gaussian elimination process. These methods are considered amongst the most reliable in a general setting. Well known theoretical results on the existence and the stability of the factorization can be proved for the class of M -matrices [35], and recent studies are involving more general matrices, both structured and unstructured. The quality of the factorization on difficult problems can be enhanced by using several techniques such as reordering, scaling, diagonal shifting, pivoting and condition estimators (see e.g. [16,44,36,7,9]). As a result of this active development, in the last years successful results are reported with ILU-type preconditioners in many areas that were of exclusive domain of direct solution methods like in circuits simulation, power system networks, chemical engineering plants modelling, graphs and other problems not governed by PDEs, or in areas where direct methods have been traditionally preferred, like in structural analysis, semiconductor device modelling and computational fluid dynamics applications (see e.g. [41,6,1,34,43]). One problem with ILU-techniques is the severe degradation of performance observed on vector, parallel and GPUs machines, mainly due to the sparse triangular solves [33]. In some cases, reordering techniques may help to introduce nontrivial parallelism. However, parallel orderings may sometimes degrade the convergence rate, while more fillin diminishes the overall parallelism of the solver [17].
Explicit preconditioning tries to mitigate such difficulties by approximating directly A −1 , as the product M of sparse matrices, so that the preconditioning operation reduces to forming one (or more) sparse matrix-vector product, and consequently the application of the preconditioner may be easier to parallelize and numerically stable. Some methods can also perform the construction phase in parallel [23,10,26,37,38]; additionally, on certain indefinite problems with large nonsymmetric parts, the explicit approach can provide better results than ILU techniques (see e.g. [14,8,24]). In practice, however, some questions need to be addressed. The computed matrix M could be singular, and the construction cost is typically much higher than for ILU-type methods, especially for sequential runs. The main issue is the selection of the non-zero pattern of M . The idea is to keep M reasonably sparse while trying to capture the 'large' entries of the inverse, which are expected to contribute the most to the quality of the preconditioner. On general problems it is difficult to determine the best structure for M in advance, and the computational and storage costs required to achieve the same rate of convergence of preconditioners given in implicit form may be prohibitive in practice.
In this study, we present an algebraic multilevel solver for preconditioning general nonsymmetric linear systems which attempts to combine characteristics of both approaches. Assuming that the matrix A admits the factorization A = LU , with L a unit lower and U an upper triangular matrix, our method approximates the inverse factors L −1 and U −1 . Sparsity in the approximate inverse factors is maximized by employing recursive combinatorial algorithms. Robustness is enhanced by combining the factorization with recently developed overlapping strategies and by using efficient local solvers.
The paper is organized as follows. In Section 2 we describe the proposed multilevel preconditioner.
In Section 3 we show how to combine our preconditioner with overlapping strategies, and in Section 4 we assess its overall performance by showing several numerical experiments on realistic matrix problems, also against other state-of-the-art solvers. Finally, in Section 5 we conclude the study with some remarks and perspectives for future work.

The AMES solver
be a n × n general linear system with nonsingular, possibly indefinite and nonsymmetric matrix A = {a ij } ∈ R n×n , and vectors x, b ∈ R n . We assume that A admits for a triangular decomposition A = LU and we precondition system (4) as with M L ≈ L −1 and M U ≈ U −1 , clearly preserving symmetry and/or positive definiteness of A. This approach of preconditioning linear systems has been extensively investigated in a series of papers by Kolotilina and Yeremin [29,30,32,31], who prescribed the nonzero pattern of the inverse factors M L and M U of A in advance equal to the pattern of the lower and upper triangular part of A + A T , respectively, and determined the entries of M L and M U explicitly by solving linear equations involving the principal submatrices of A (the 'FSAI' preconditioner). Chow suggested to use as pattern for the inverse factors the structure of the lower and upper triangular part of (A + A T ) p , where p is a positive integer [12,13,45]. The larger p, in general the higher the quality of the computed preconditioner, although the construction, storage and application costs tend to increase rapidly with p. Blocking and adaptive strategies have been recently studied to overcome these problems [26,18,25]. Benzi and Tůma proposed to compute the entries of matrices M L and M U by means of a (twosided) Gram-Schmidt orthogonalization process with respect to the bilinear form associated with A, and to determine the best structure for the inverse factors dynamically, during the construction (the 'AINV' preconditioner). Sparsity is preserved in the process by discarding elements having magnitude smaller than a given positive threshold [3,4]. In this study we analyse multilevel mechanisms, recursive combinatorial algorithms and overlapping techniques, combined with efficient local solvers, to enhance robustness and reduce costs for the approximation of the inverse factors. We refer to the resulting preconditioner as AMES (Algebraic Multilevel Explicit Solver). It is easier to describe the AMES method by using graph notation, dividing the solution of system (4) in five distinct phases: 1. a scale phase, where the coefficient matrix A is scaled by rows and columns so that the largest entry of the scaled matrix has magnitude smaller than one; 2. a preorder phase, where the structure of A is used to compute a suitable ordering that maximizes sparsity in the approximate inverse factors; 3. an analysis phase, where the sparsity preserving ordering is analyzed and an efficient data structure is generated for the factorization; 4. a factorization phase, where the nonzero entries of the preconditioner are computed; 5. a solve phase, where all the data structures are accessed for solving the linear system.
Below we describe each phase separately.

Scale phase.
We initially scale system (4) by rows and columns as where the n × n diagonal scaling matrices D 1 and D 2 have the form For simplicity, we still refer in this paper to the scaled system (5) as Ax = b.

Preorder phase.
We use standard notation of graph theory to describe this computational step. We denote by Ω(Ã) the undirected graph associated with the matrix First, Ω(Ã) is partitioned into p non-overlapping subgraphs Ω i of roughly equal size by using the multilevel graph partitioning algorithms available in the Metis package [28]. For each partition Ω i we distinguish two disjoint sets of nodes (or vertices): interior nodes that are connected only to nodes in the same partition, and interface nodes that straddle between two different partitions; the set of interior nodes of Ω i form a so called separable or independent cluster. Upon renumbering the vertices of Ω one cluster after another, followed by the interface nodes as last, and permuting A according to this new ordering, a block bordered linear system is obtained, with coefficient matrix of the form In (6), each diagonal block B i corresponds to the interior nodes of Ω i , and the blocks E i and F i correspond to the interface nodes of Ω i ; the block C is associated to the mutual interactions between the interface nodes. In our multilevel scheme we apply the same block downward arrow structure to the diagonal blocks B i ofÃ; the procedure is repeated recursively until a maximum number of levels is reached, or until the blocks at the last level are sufficiently small to be easily factorized. As an example, in Figure 1(b) we show the structure of the sparse matrix rdb2048 from Tim Davis matrix collection [15] after three reordering levels.
To reduce factorization costs, a similar permutation is applied to the Schur complement matrix S = C − EB −1 F as follows

Analysis phase.
In the analysis phase, a suitable data structure for storing the linear system is defined, allocated and initialized. We use a tree structure to store the block bordered form (6) ofÃ. The root is the whole graph Ω, and the leaves at each level are the independent clusters of each subgraph. Each node of the tree corresponds to one partition Ω i of Ω(Ã), or equivalently to one block B i of matrixÃ. The information stored at each node are the entries of the offdiagonal blocks E and F of B i s father, and those of the block C of B i after its permutation, except at the last level of the tree where we store the entire block B i . All these matrices are represented in the computer memory using a compressed sparse row storage format, except for blocks F i that are stored in compressed sparse column format. Blocks E i and F i can be very sparse; many of their rows and columns can be zero, and this leads to a significant saving of computation.

Factorization phase.
The approximate inverse factorsL −1 andŨ −1 ofÃ write in the following form The original structure of the rdb2048 matrix.
(b) The structure of rdb2048 after permutation.
(c) The structure of the inverse factor. In red are displayed the entries actually stored. and L S , U S are the triangular factors of the Schur complement matrix Some fill-in may occur inL −1 andŨ −1 during the factorization, but only within the nonzero blocks. This two-level reordering scheme was used in the context of factorised approximate inverse methods for the parallelization of the AINV preconditioner in [2]. Differently from [2], we apply the arrow structure (6) recursively to the diagonal blocks and to the first level Schur complement as well, to gain additional sparsity. The multilevel factorization algorithm requires to invert only the last level blocks and the small Schur complements at each reordering level; the blocks W i , G i do not need to be assembled explicitly, as they may be applied using Eqn (9). For the rdb2048 problem, in Figure 1(c) we display in red the actual extra storage required by the exact multilevel inverse factorization in addition to matrix A; these represent only 34% of the total nonzeros of A. From the knowledge of the red entries, the blue ones can be retrieved from Eqn (9), using the off-diagonal blocks of A. We also permute the large Schur complement at the first level into a block bordered structure, until we reach a maximal number of levels or a given minimal size. The last-level matrix is inverted inexactly. An inexact solver is also used to factorize the last-level blocks B i in (10).

Solve phase.
In the solve phase, the multilevel factorization is applied at every iteration step of a Krylov method for solving the linear system. Notice that the inverse factorization ofÃ may be written as (11), we obtain the following expression for the exact inverse We can derive preconditioners from Eqn. (12) by computing approximate solvers B −1 for B andS −1 for S. Hence the preconditioner matrix M will have the form and the preconditioning operation Notice that Algorithm 1 is called recursively at lines 1-3, asB andS also have a block bordered structure upon permutation.

Algorithm 1
The preconditioning operation in the AMES solver.

Combining the AMES solver with overlapping
In [20], Grigori, Nataf and Qu have introduced an overlapping technique to enhance the robustness of multilevel incomplete LU factorization preconditioning computed from matrices reordered in arrow form, e.g. using the nested dissection method by George [19]. The multilevel mechanism incorporated in the AMES preconditioner described in the previous section is based on a nested dissection-like ordering, and thus it can easily accomodate for overlapping. We have tested this idea in our numerical experiments, and in this section we shortly describe the procedure adopted. The results of our experiments are reported in Section 4.

Background
Let Ω = (V (Ω), E(Ω)) be the graph of A, V (Ω) denoting the set of vertices and E(Ω) the set of edges in Ω. If the graph is directed, we denote an edge of E issuing from vertex u to vertex v as (u, v); u is called a predecessor of v, and v a successor of u. If the graph is undirected, we denote the edges of E by non-ordered pairs {u, v}; u is called a neighbour of v. As in the previous section, we assume that Ω is partitioned into p independent non-overlapping subgraphs Ω 1 , . . ., Ω p , and we call S the set of separator nodes, straddling between two different partitions. Goal of overlapping is to extend each independent set of Ω by including its direct neighbours, similarly to the overlapping idea used in other domain decomposition methods, for example in the restricted additive Schwarz method [39,40]. Following [20], we denote by V (Ω i,ext ) the separator nodes that are successors of Ω i , and by V (Ω ext ) the complete set of successor nodes of all the subdomains Then Ω i is extended to the setΩ i as and the separator S is extended toŜ by adding the successors of nodes in V (Ω ext ), that is Using this notation, the overlapped graph of A,Ω = (V (Ω), E(Ω)), is introduced as follows. First define the overlapped subgraphΩ i and the overlapped separatorS as, respectively, For simplicity we refer to (x, i) as x i . Then the vertex set V (Ω) of the overlapped graphΩ is formed by the disjoint union of the V (Ω i )'s and of V (Ŝ) as Recall that, given the union B of a family of sets indexed by the index set I, their disjoint union C is defined as the set At this stage, it is useful to introduce the two projection operators Π 1 and Π 2 such that With this notation, the set of edges of the overlapped graphΩ is defined according to their projection onto the original graph as follows The following property, established in [20], ensures an equivalence between the equations of the overlapped system and those of the original system.

Property 1
Let Ω be the associated directed graph of a given system of linear equations and u be a vertex of V (Ω). LetΩ be the overlapped graph, and let u i be a vertex of This property shows that there exists a bijection between the nonzeros of the equation corresponding to vertex u in the original system and the nonzeros of the equation corresponding to its dual u i , where Π 1 (u i ) = u. From a matrix viewpoint, to each nonzero entryã ui,vi in the overlapped matrix there is a unique nonzero entry a u,v in the original matrix, where Π 1 (u i ) = u and Π 1 (v i ) = v. Therefore there is a one-to-one correspondence between equations in the original system and those in the overlapped system. By solving the overlapped system, we can automatically obtain the solution of the original system.

Example
We display a simple example from [20] to describe shortly how the overlapping procedure works in practice. We consider a 5 × 5 matrix having the structure shown in Figure 2(a). The graph consists of two independent subgraphs Ω 1 = {1, 2}, Ω 2 = {3} and one separator S = {4, 5}. We just pick the first subgraph and the separator set to explain. Separator nodes that are successors of Ω 1 are the set Next, we compute the overlapped separator setS. The vertices of V (Ω 1 ) and Similarly for the other two diagonal blocks, and this is shown in Figure 2(b).
From Eqn. (20), we can construct the edges fromΩ 1 toS. These are the nonzero entries of the overlapped interface blockF 1 , adopting the same notation as in (6). We pick the V (Ω 1 ) = {1, 2, 4, 5} rows and V (Ŝ) = {4, 5, 1, 3} columns of the original matrix, and we set the columns corresponding to the common The matrix after one-level overlapping Similar procedure is followed for the other blocksF i ,Ẽ i . Finally, the overlapped matrix has the form given in Figure 2(b). The block arrow structure of the original matrix is preserved. However, symmetry is lost and the sparsity pattern also changes significantly.

Analysis
It is interesting to analyse the effect that overlapping may produce on the AMES method. According to (15) and (18), the size and the number of nonzeros in each subgraph is increased after overlapping. According to (20), the interconnections between subdomains and separator are reduced in the overlapped system, as the original interconnectivities are all removed.
The more nodes are added to the subgraphs, the richer they are in terms of information about the system matrix, and a larger performance improvement may be expected. In the overlapped system, each blockB i has the following structure From Eqn. (13) we see that the set of neighboring nodes V (Ω i,ext ) corresponds to the nonzero columns of the block F i , and the nonzero elements of F i are determined by the set of interconnections E(Ω i , Ω iext ). Therefore, the more dense and larger the blocks F i , i = 1 : p, (that is, the size of separator) in the original matrix, the more nodes and interconnections are added to subdomains, and a larger reduction of the number of iterations can be achieved.

Algorithmics
The AMES preconditioning algorithm described in Section 2 with one extra overlapping phase writes as follows: 1. a scale phase, where the matrix A is scaled by rows and columns so that the largest entry of the scaled matrix has magnitude smaller than one; 2. a preorder phase, where the structure of A is used to compute a suitable ordering that can maximize sparsity in the approximate inverse factors; 3. an overlap phase, which extends each block B i and the Schur complement, and generates the overlapped matrixÃ and the right-hand side vectorb; 4. an analysis phase, where the sparsity preserving and overlapping orderings are analyzed and an efficient data structure is generated for the factorization; 5. a factorization phase, where the entries ofÃ are processed to explicitly compute the approximate inverse factors; 6. a solve phase, that accesses all the data structures for solving the overlapped linear system.
7. a restriction phase, that restricts the solution obtained from the overlapped system to the original system, and obtains the solution.

Numerical experiments
In this section we present the results of our numerical experiments to illustrate the performance of the AMES preconditioner, also against other state-of-the-art methods and software for solving general linear systems. The selected matrix problems are extracted from the public-domain matrix repository available at the University of Florida [15], and arise from various application fields. We present a summary of the characteristics of each linear system in Table 1 (GMRES) method by Saad and Schultz [42]. In all our runs we started the iterative solution from the zero vector and we stopped it when either the initial residual was reduced by twelve orders of magnitude or when no convergence was achieved after 5000 matrix-vector products. To limit memory costs, we restarted the GMRES algorithm every 500 iterations. The right-hand side b of the linear system was chosen so that the solution is the vector of all ones, that is b = Ae with e = [1, ..., 1] T . In each run we recorded the following performance measures: 1. the density ratio nnz(M L +M U )

nnz(A)
, that is the ratio between the number of nonzeros in the preconditioner matrix M = M U M L versus the number of nonzeros in the coefficient matrix A; 2. the number of iterations Its required to reduce the initial residual by 12 orders of magnitude starting from the zero vector; 3. the CPU time cost in seconds for completing the preorder phase (denoted by t p ), for constructing the approximate inverse factorization (t f ), and for solving the linear system (t s ). Symbol "-" means that the corresponding phase does not apply to the given run. For example, some of the preconditioners used for the comparison against our method do not have a preorder phase.
The codes were developed in Fortran 95 and the experiments were run in double precision floating point arithmetic on a PC equipped with an Intel(R) Core(TM)2 Duo CPU E8400, 3.00 GHz of frequency, 4 GB of RAM and 6144 KB of cache memory. In the coming sections we study the effect of using different parameter settings, and we illustrate the overall performance on the selected matrix problems.

Performance of the multilevel mechanism
The AMES method can be seen as a multilevel generalization of factorized approximate inverse techniques such as the FSAI preconditioner by Kolotilina and Yeremin, and the AINV preconditioner by Benzi and Tůma, that we recalled in Section 2. Therefore, first we present some comparison between these methods, to show the benefit of the multilevel mechanism. The results are reported in Table 2. For these runs, we considered four matrix problems from Table 1, that are orsirr 1, 1138 bus, bcsstk27 and epb0. In our AMES solver, we inverted the last level blocks using ILU, FSAI and AINV factorizations. For ILU, we used the multilevel implementation available in the ILUPACK package [5] (this combination is referred to as AMES ILU in the table). For FSAI, we used the structure of the nonzero pattern of the lower (resp. upper) triangular part of the symmetrized block for the approximate inverse factors, and also the square of this pattern (this combination is referred to as AMES FSAI ). Finally, for AINV we used the implementation kindly provided by the authors (this combination is referred to as AMES AINV ). The dropping threshold value selected for the AINV, AMES AINV and AMES ILU methods (referred to as Drop in the Table) is an absolute value, and was computed so that the resulting preconditioners had roughly equal memory cost. We used the default value for the parameter condest = 10 (norm bound for the inverse factors L −1 and U −1 ) in ILUPACK.
In our runs, the multilevel variants AMES FSAI and AMES AINV performed consistently better than the FSAI and AINV solvers in terms of convergence rate and storage cost. This is probably due to the multilevel mechanism that enabled us to exploit sparsity in the inverse factors more effectively. The best solutions with AMES were obtained using ILU as local solver, while the threshold-based dropping rules of the AINV method often computed a better pattern for the approximate inverse factors than the static approach used in the FSAI method. We can see evidence of this behaviour in Figures 3 -6, where for one of the last-level blocks of the permuted coefficient matrix (6) we compare the structure of its exact inverse factor L −1 , and of the approximations M L and W T of L −1 as computed by, respectively, the AMES FSAI code using the square of the pattern of the symmetrized block, and by the AMES AINV code. Large to small entries are depicted in different colors, from red to orange and yellow. The approximation is good for the 1138 bus problem (Figure 3) but poor for the orsirr 1 matrix (Figure 4), and this is confirmed by the different convergence results for the two problems, reported in Table 2. On some larger problems, like the cz40948 and the ABACUS shell ud problems, shown in Figures 5 -6, we found that L −1 had no evident structure; in this case we had to increase the number of nonzeros in M L and W T significantly to converge. For example on the ABACUS shell ud problem, AMES AINV converged in 468 iteration with nnz(Z+W ) nnz(A) = 11.6 while AMES FSAI did not converge at this value of density. In these situations, uniformly better convergence were obtained using ILU as local solver. We will focus mostly on this choice of local solver for the experiments of this paper. Notice that in this case the entries of the inverse factors are not computed explicitly, and the application of the preconditioner is carried out through a backward and forward substituion procedure. Other options may be considered for the last level solver, such as the ARMS method [44] and enhanced FSAI methods [27], but these are not included in the presented analysis.

Method
Pattern Drop/condest Its

Varying the number of independent clusters at the first level
We considered three matrix problems in our runs: cz20468, ABACUS shell ud and cz40948. In Table 3 we show the results varying the number of independent clusters p at the first level of reordering of A in (6). For each problem, we used the same number of levels n lev in the AMES structure, and tuned the drop tolerance in the local ILU factorization to keep the memory ratio nnz(M L +M U )

nnz(A)
roughly constant while increasing p in different runs. Clearly, larger p results in more independent clusters of smaller size, and in larger Schur complement matrices. In the table we report the ratio sizeB sizeA S between the average size of the independent clusters B i and the size of the Schur complement at the first level. Increasing p reduces in turn sizeB sizeA S to values smaller than 1. Using ILU as local solver, the best convergence results were obtained when sizeB sizeA S ≈ 1. Our experiments indicate that for good performance the size of each independent cluster should be approximately equal to that of the Schur complement.

Varying the number of reduction levels for the diagonal blocks
We consider again matrices cz40948, ABACUS shell ud and cz20468 for our tests. We varied the number of levels n lev from 1 to 3 in the multilevel reordering of the diagonal blocks. In each run we tuned the dropping threshold parameter to have roughly the same memory cost in the experiments for each matrix. We chose the value of p for each problem so that sizeB sizeA S ≈ 1 as reported in Section 5.2. The last level blocks were factorized using ILUPACK [5]. The results reported in Table 4 show that using more levels can reduce the number of iterations for similar memory ratio as we can gain additional sparsity during the factorization. However, probably due to our non optimized implementation, the solution cost tends to increase. From our experiments, a small number of reduction levels is recommended to use.

Varying the number of reduction levels for the Schur complement
The Schur complement matrix relative to the block C in (6) typically preserves a good deal of sparsity, and this can be further exploited during the factorization by applying, e.g., the multilevel nested dissection reordering to A S , similarly to what is done to the upper leftmost block B. We implemented this idea at the first permutation level, using ILU factorization as local solver and selecting the same values of p and n lev for each matrix problem. We tuned the drop tolerence in the ILU factorization to have roughly the same memory costs in different runs. We varied n levAS from 0 to 3 (n levAS = 0 means that only the diagonal blocks of the upper-left block B are permuted). The results reported in Table 5 show that the simultaneous permutation of both the diagonal blocks of B and of the Schur complement can make the preconditioner more robust. We adopted this strategy in the experiments illustrated in the coming sections, selecting in each run the value of n levAS that minimized the total solution cost.

Comparison against other solvers
We compared the performance of the AMES preconditioner against other three popular algebraic preconditioners for solving linear systems, that are the ILUPACK solver developed by Bollhöfer and Saad [5], the Algebraic Recursive Multilevel Method (ARMS) proposed by Saad and Suchomel [44], and the SParse Approximate Inverse preconditioner (SPAI) introduced by Grote and Huckle [21]. As in the previous experiments, for each run we recorded the CPU time from the start of the solution until the initial residual was reduced by 12 orders of magnitude or until the process failed. We declared a solver failure when no convergence was achieved after 5000 iterations of the restarted GMRES method. We selected the parameters carefully to have a fair comparison between different methods. In AMES, following our conclusions from Section 4.2, we selected the number of blocks B i at the first level so that their average size is almost equal to the size of the Schur complement. For every problem we tested different combinations of number of levels n lev of recursive factorization and different values for the dropping threshold parameter droptol for the Matrix p Its    default droptol = 0.001, costing nnz(M ) nnz(A) = 33.9 and t f = 1111s; and SPAI could not converge in 5000 iterations with nnz(M ) nnz(A) = 0.19, using the default value droptol = 0.6. The number of levels of recursive factorization in the multilevel methods ILUPACK and ARMS are calculated automatically by the original codes developed by their authors. We point out that the performance comparison between AMES and the other solvers at fixed memory occupation may be a little penalizing for the AINV, FSAI and SPAI preconditioners as onelevel approximate inverses inherently need more memory; the ARMS method is a multilevel solver, but it factorizes the diagonal blocks without any permutation.
In Table 6, we show the complete results of our experiments. These include number of iterations (Its), density ratio ( nnz(M L +M U )

nnz(A)
), time costs for the preordering (t p ), factorization (t f ) and solve phase (t s ). We also tested the unpreconditioned GMRES for these matrices problems, and no convergence is achieved. We clearly see the good potential of the multilevel mechanism incorporated in the AMES preconditioner to reduce the number of iterations of Krylov methods, also in comparison to other multilevel solvers at low to moderate memory costs. In our examples, AMES was more robust than these solvers especially at low memory ratios.

Effect of overlapping
We solved several problems from Table 1 combining the AMES method with overlapping after the first level of reordering in (6). In these runs, we set n lev =  2, and we tuned the droptol parameter to have roughly the same memory costs in the experiments with and witout overlapping. In the last two columns of Table 7 we give the effect of overlapping on the change in size and in number of nonzeros for the overlapped system. The number of iteration (Its) are almost the same after overlapping for problems cz20468 and cz40948, while for problems sme3Db, ABACUS shell ud and raefsky3 we observed a consistent reduction of the number of iterations Its by a factor between 9.5% and 23.8% and of the solving time t s by a factor between 21.4% and 29.9%. This is in agreement with our analysis of Section 3. In Table 8, for each problem we studied the sparsity pattern of block F and the size of blocks B and C before and after overlapping is applied at the first reordering level. The quantity Sp F denotes the ratio between the number of nonzero elements and the size of F , that is the sparsity degree nnz(F ) size(F ) . As we can see, the cz20468 and cz40948 problems have the smallest relative size of the separator C and also the smallest value of Sp F ; this means that less information is added to the subdomains. Following   Table 6: Performance comparison of the multilevel approximate inverse preconditioner against other iterative solvers, both one-level and multilevel. the analysis reported in Section 3, the overlapping technique is less likely to help on these two matrices, and this is also confirmed by the numerical results. Differently, problems sme3Db and raefsky3 show larger values of size C and Sp F and in fact overlapping has a better effect on convergence for these two problems.
In our experiments we found that a small number of independent clusters p is recommended to use when overlapping.

Utilizing direct solvers in the AMES framework
The results of previous sections indicate that the multilevel mechanism can be effective to reduce the memory burden but, at least in our implementation, tends to increase the cost per iteration. As an attempt of a possible remedy, we performed some runs setting the dropping threshold parameter droptol equal to zero, and using a sparse direct solver, namely the routine MA38 from the HSL Mathematical Software Library [22], as a local solver. No approximation is introduced and the Schur complements are exact. Therefore in each problem we can obtain convergence in one or two iterations, and the solving phase is much cheaper. This can be observed in Table 9 on selected matrix problems. Comparing against the results with inexact inversion, we see that using a direct solver as local component can save computational time at only moderate extra storage cost.   3.01 1 5.2 11.6 0.8 Table 9: Using the AMES factorization as a direct solver.

Conclusions
In this paper a recursive multilevel implementation of factorized sparse approximate inverse preconditioners for Krylov subspace methods was discussed. We used recursive combinatorial techniques and overlapping strategies as an attempt to remedy two typical drawbacks of explicit preconditioning, that are lack of robustness and high construction cost. The numerical experiments show that these strategies can improve the performance of conventional approximate inverse methods, yielding iterative solutions that can compete favourably against other popular solvers in use today. Parallelism can be exploited at various levels in our method, alongside other code optimization. Fine-grained blocking, filtering, postfiltering, adaptive pattern selection strategies have been shown to be promising approaches in other contexts [26,18,25,13,12,11], and these can be considered also in our setting. In a distributed memory implementation, it will be natural to split the oct-tree by assigning the local problems to different processors. An efficient use of recursive combinatorial algorithms may reduce considerably the size of the Schur complements, hence the amount of inter-node communications. Memory demands, an important bottleneck of modern algorithms, are also limited, but this does not penalize much the overall numerical efficiency of the solver, as illustrated by the experiments of Tables 6 and 9. Overlapping does not destroy the sparsity structure of the matrix and can reduce further the interconnections between subdomains and separator set. Hence it is worthwhile considering it in a parallel setting as well. However, the parallel implementation of a fully distributed Schur complement formulation may not be trivial and will be considered in a separate study.

Acknowledgements
The work of the first and of the third authors is funded by the Ubbo Emmius scholarship of the University of Groningen. The work of the last author is supported by NSFC (61170309), the Fundamental Research Funds for the Central Universities (ZYGX2013Z005). We are grateful to Miroslav Tůma for motivating discussion and for supplying the AINV package for our numerical experiments. Finally, we would like to thank the anonymous reviewers for their valuable and thorough comments that we believe helped improve significantly the presentation and the quality of the paper.