The static load balancing problem for a mesh based application involves partitioning the mesh into subdomains. The subdomains can then be distributed over the processors and calculation carried out in parallel. Different partitions of the mesh may result in different times to completion for the calculation. It is therefore necessary to examine the quality of the partitioning based on its effect on the application code. There are a number of factors.
The first factor is that of the load balance. The computational work of each processor should be balanced, so that no processor will be waiting for others to complete. Assuming that the computational work per processor is proportional to the number of mesh nodes in the subdomain, then to achieve load balance it is necessary for the number of nodes in each subdomain to be the same.
The second factor is that of the communication cost. When forming the discretized equations on a node of the mesh, the contributions from its nearest neighbor nodes will usually be needed. Depending on the order of the discretization scheme, contributions from more distant neighboring nodes may be necessary. On a parallel computer, accumulating the contributions from nodes that are not on the current processor will incur communication cost. It is known that on distributed memory parallel computers the cost of accessing remote memory is far higher than that of accessing local memory (typically a ratio of between 10 to 1000). It is therefore important to minimize the communication cost.
There are other factors that affect the time to completion of a parallel calculation. Often, to maximize the parallelism, the solver algorithm implemented on a parallel computer is not chosen to be mathematically equivalent to the sequential algorithm. Therefore the number of iterations needed for a parallel code to converge on multi-processors may not be the same as that on a single-processor. This is particularly true when implicit algorithms are used, and the way a mesh is partitioned can affect the rate of convergence considerably. However for now, we will in essence assume that a linear explicit solver is being used and seek to meet two criteria
It is useful to introduce the idea of a communication graph, and to distinguish the computational mesh from its communication graph. The communication graph characterizes the data dependency of the computation on the mesh. For instance, when constructing a communication graph for the purpose of partitioning a finite element mesh,
When there are only two subdomains or processors, the graph partitioning problem
becomes a graph bisection problem, where given a graph with
vertices and edges , it is required to find a partition
such that
,
and the edge-cut :
The graph bisection problem has been studied in the past by many authors (see, e.g., [53,63,11]) in the context of graph theory as well as VLSI circuit layout. Advances in parallel computing hardware and software has renewed interest in the problem. The graph bisection problem is an NP hard problem, so there are no known algorithms that can find the exact solution to the problem in polynomial time. Most of the graph bisection methods therefore seek a good approximation to the optimal partitioning that can be calculated efficiently.
In the following a number of partitioning algorithms are described. Many of them are bisection based. A partitioning for an arbitrary number of processors requires the recursive use of the bisection algorithm (for a number that is not a power of two, the bisection will give two subdomains of different sizes, see [42]). We shall denote the number of vertices of the graph, and the number of processors (subdomains).