Next: Spatial Decomposition Program Details Up: Main Previous: Sequential Link-Cell Algorithm

3. Spatial Decomposition Theory

QDPD was originally written in Fortran 77 as a serial program. A lot of the formalism of the sequential link-cell algorithm relies heavily on the Allen and Tildesley book [9] and computer routines (such as MAPS and TOPMAP) discussed in the book and available on the Web [13]. We have retained the names of the Allen and Tildesley routines in our program and in the discussion. Routines discussed later in the text, such as EXTVOL, LEBC, and MOVPAR, are parallel routines and have no counterpart in Allen and Tildesley. To improve computational performance, a parallelization was done relatively quickly using a simplified version of the replicated data approach and the standard message passing interface library (MPI [1]), as described in Sims et. al. [14]. We reported speedups of as much as 17.5 times on 24 processors of a 32 processor shared memory SGI Origin 20001. When doing a calculation on multiple processors, the total run time can be represented as the sum of computation (cpu) time and communication time, viz.,

In the replicated data approach, as P (number of processors) increases, tcpu goes down, but we still have to communicate the same amount of information (proportional to N (number of particles), so it doesn't scale). Also distributed memory machines often do not possess enough memory on a processing node to hold all of the data for a large job. When the goal is to simulate an extremely large system on a distributed-memory computer to allow for the larger total memory of the distributed-memory computer and also to take advantage of a larger number of processors, a different approach is needed. Since the link-cell algorithm we used in the sequential and replicated data approaches breaks the simulation space into domains, it seems natural to map this geometrical, or domain, decomposition onto separate processors. Doing so is the essence of the parallel link-cell technique [11, 15]2. By subdividing the physical volume among processors, most of the computation becomes local and the communication is minimized so there is, in principle, an N / P scaling (N = number of particles, P = number of processors), an efficient approach for distributed-memory computers and networks of workstations.The basic idea is this:
Split the total volume into P domains, where P is the number of processors. If we choose a 1D decomposition ("slices of bread"), then the pth processor is responsible for particles whose x-coordinates lie in the range

Similar equations apply for 2D and 3D decompositions for simulation dimensions Ly and Lz. Whether the decomposition is 1D, 2D, or 3D depends on the number of processors. An algorithm due to Plimpton [17] is used to assign P processors to a 3D box so as to minimize the surface area (and hence, yield a good load balancing). For P processors and a given simulation box of dimensions Lx, Ly, and Lz, the algorithm is the following. Loop through all factorizations of P into Px, Py, and Pz processors, computing the area of the resulting box, and pick the one with the minimum surface area. Of multiple equal surface areas (for example, Px, Py, Pz = (4,2,2), (2,4,2), (2,2,4)), pick the one with PxPyPz.

Each processor runs a link-cell program corresponding to a particular domain of the simulation box. For example, in Fig. 5 nine processors were used to divide the 2D simulation space into domains, each processor being assigned to one of the nine domains (here we assign each processor an index, where the indices start at 0). We also show the central processor's domain being subdivided into cells. To complete the force calculation on particles in cells at the interface between processors, each processor needs to know information about the particles in the adjacent cells, which now will be found on a neighboring processor. To handle this problem we construct an extra layer of cells on each processor at the interface between processors. At each timestep we communicate information across the interface between adjacent processors describing the particles in these edge cells (subroutine EXTVOL). The information that has to be passed by EXTVOL is the information needed for the forces calculations, which is,  and the unique particle number discussed below. For example, in Fig. 5 we show processors 3, 4, and 5 and we also show, with dashed lines, the cells in processors 3 and 5 which are adjacent to 4. Information about the particles in these dashed line cells is communicated to 4, making up "ghost" cells on 4. To complete the "extended volume" needed on processor 4 to compute the forces on all the particles it "owns", information is communicated (swapped) across the interface between adjacent processors in the Y direction as well. To account for the cross-hatched corner cells, the swap in Y includes information about not only particles that the processor owns but also information about "other" particles in "ghost" cells. So processor 7 sends information about particles in the dashed line cells as well as the cross-hatched cells (obtained from processors 6 and 8) to processor 4. At this point processor 4 has all the information it needs to calculate forces on all the particles it owns (processor 4 now has information about all the particles shown in the extended volume comprised of the domain of processor 4 plus the surrounding dashed line ghost cells), and similarly all of the other processors have all the information they need. These exchanges of data can be achieved by one set of communications between the processors. A processor only has to communicate once with all of its neighbors, so each processor communicates with at most four other processors (six in 3D), rather than, say 64 in a 64 processor replicated data calculation. Now on each processor, form a link-cell list of all particles in the original volume plus the extended volume. Loop over the particles in the original volume, calculating the forces on them and their pair particle (for conservation of momentum). Care must be taken to add these pair particle forces on particles in the extended volume to the forces on the pair particles in the processor "owning" them, which necessitates an extra set of communications between processors (the reverse of the communication swaps setting up the "ghost" cells). This extra communication step is necessary in the QDPD method since the interparticle force calculation involves the use of a random number for thermal effects and momentum conservation requires that the same random number be used in the equal and opposite force calculation. Finally calculate the new positions of all particles and move the particles which have left a processor to their new home processor. If particles move into domains controlled by other processors, information about the particle (the particle’s properties) must be moved to its new "home" processor. Again these exchanges of data can be achieved by one set of communications between the processors, and are implemented in subroutine MOVPAR. In this set of communications, all information about a particle needed for one time step must be communicated, not just the information needed for the forces calculation (the information communicated in EXTVOL as explained above).

Our spatial decomposition program has the following added features. First, following Plimpton [17], we distinguish between "owned" particles and "other" particles, those particles that are on neighboring processors and are part of the extended volume on any given
processor. For "other" particles, only the information needed to calculate forces is communicated to neighboring processors. Second, the QDPD technique is being applied to suspensions, so there are two types of particles, "free" particles and particles belonging to ellipsoids (the solid inclusions). A novel feature of this work is that we explicitly do not keep all particles
belonging to the same ellipsoid on the same processor. Since the largest ellipsoid that might be built can consist
of as much as 50 % of all particles, that would be difficult if not impossible to handle without serious
load-balancing implications. What we do is assign each particle a unique particle number when it is read in.
Each processor has the list of ellipsoid definitions consisting of lists of particles defined by these unique particle
numbers. Each processor computes solid inclusion properties for each particle it "owns", and these properties
are globally summed (using MPI_REDUCE [1, 18] over all processors so that all processors have the same solid inclusion properties. Since there are only a small
number of ellipsoids (relative to the number of particles), the amount of communication necessary for the
global sums is small and the amount of extra memory is also relatively small. Hence it is an efficient technique.


1 Certain commercial equipment, instruments, or materials are identified in this paper to foster understanding. Such identification does not imply endorsement by the National Institute of Standards and Technology, nor does it imply that the materials or equipment identified are necessarily the best available for the purpose.

2 See Plimpton [16] for excellent discussions of all fast parallel algorithms.


Next: Spatial Decomposition Program Details Up: Main Previous: Sequential Link-Cell Algorithm