Once the flow is calculated, it is necessary to identify the nodes to be migrated. In doing so, it is essential that the edge-cut as well as the amount of data migration are minimized in addition to restoring the balance, and this process has to be in parallel.
In [83] this was achieved by using formula (8) to decide the amount of load to be moved, taking into account the minimization of edge-cut and load balancing. Nodes are selected using the idea of relative gain (Section 3.4.3). Results in [83] on an adaptive finite element problem with a mesh of 31172 nodes refined in 8 steps to 224843 nodes showed that this approach was successful. On the Cray T3D, the time for each dynamic load balancing associated with each refinement was less than a second for 64 processors. Furthermore, the edge-cut was competitive with those produced by repartitioning the meshes. The most telling fact of all was that the number of nodes migrated was as little as a few percent of the total number of nodes, compared with near 100% for repartitioning.
In [70,71], the parallel METIS algorithm [50] was extended to add the capability of dynamic load balancing. One of the differences compared to JOSTLE-MD is that dynamic load balancing was not applied on the original graph. The proposed algorithm took a two stage approach - the multilevel diffusion stage, followed by the multilevel refinement stage. When applied to a partitioned mesh with unbalanced load, the algorithm first generated a multilevel of meshes, one coarser than the other. Dynamic load balancing was then carried out on the coarse meshes, using either ``directed diffusion'' or ``undirected diffusion''. The former used the algorithm of potentials of Section 4.3.4 to calculate the flow, while the latter was close to the classical diffusion algorithm of Section 4.3.1. Boundary vertices were visited in a random order and moved to the neighbor to achieve load balance. After the load was balanced, multilevel refinement started and the graph was refined. Boundary vertices were again visited randomly and checked to see if migration of them to a neighboring processor would result in reduced deviation from the original partitioning; reduced edge-cut or improved load balance. The algorithm was compared against JOSTLE-MD and was shown to be very competitive.
In [67] a layer process was used for node selection. Each boundary node is assigned a label equal to the partition that has the maximum number of nodes that are adjacent to this node. Nodes that are adjacent to the boundary nodes are then marked based on the labels of the boundary nodes. The process is recursively carried out until all of the nodes are marked.
In [13,25,26], element selection for the so called tiling algorithm and the iterative tree balancing algorithm is based on the idea of priority, similar to the concept of gain used in the K-L algorithm. Elements are assigned priorities (initially zero) based on the locality of their neighbors. An element's priority is decreased by one for each neighbor in its own processor, increased by two for each neighbor in an importing processor, and decreased by two for each neighbor in any other processor.
An important aspect of the node selection process is that nodes need to be selected to minimize the data migration in addition to minimizing the edge-cut. In [70], the size and weight of a vertex are defined as the cost of migrating this vertex and the work load of this vertex, respectively. A vertex is dirty if it is displaced (no longer on its original subdomain) in the process of balancing. Two heuristics for reducing the TotalV (the total sum of the size of dirty vertices after load balancing) and MaxV (maximum of the sums of the sizes of vertices which migrate into or out of any one partition after load balancing) were proposed. This is necessary partly because in ParMETIS, load balancing was carried out on coarse graphs, where each vertex represents a varying number of original vertices of the graph. Nonetheless the technique can be useful for applications where the mesh nodes have heterogeneous loads due to, say, the use of the local time stepping or local spatial approximation schemes.
The first heuristic was proposed to minimize MaxV. It was applied in the multilevel diffusion stage and involved a suppression factor. The density of the vertex is its weight divided by its size. What flow calculations give is the amount of load (the sum of the vertex weights) that need to be migrated to restore the load balance. Clearly to satisfy a given flow, it is cheaper to send vertices of higher density. Thus it was proposed [70] that only vertices with a density higher than the average vertex density of the graph multiplied by the suppression factor qualified for migration. The second heuristic was proposed to reduce TotalV. Here priority was given to the migration of dirty vertices by the use of a cleanness factor in the multilevel refinement stage. Only those clean vertices whose gain is greater than their size times the cleanness factor were considered for migration. The use of either heuristic may increase the edge-cut [70], nonetheless it was possible to improve both TotalV and MaxV with only a slight deterioration of edge-cut.