QDPD was originally written in Fortran 77 as a serial program. A lot of the formalism of the sequential link-cell algorithm relies heavily on the Allen and Tildesley book [9] and computer routines (such as MAPS and TOPMAP) discussed in the book and available on the Web [13]. We have retained the names of the Allen and Tildesley routines in our program and in the discussion. Routines discussed later in the text, such as EXTVOL, LEBC, and MOVPAR, are parallel routines and have no counterpart in Allen and Tildesley. To improve computational performance, a parallelization was done relatively quickly using a simplified version of the replicated data approach and the standard message passing interface library (MPI [1]), as described in Sims et. al. [14]. We reported speedups of as much as 17.5 times on 24 processors of a 32 processor shared memory SGI Origin 20001. When doing a calculation on multiple processors, the total run time can be represented as the sum of computation (cpu) time and communication time, viz.,
![]()
In the replicated data approach, as P (number of
processors) increases, tcpu goes down, but we still have
to communicate the same amount of information (proportional
to N (number of particles), so it doesn't
scale). Also distributed memory machines often do not
possess enough memory on a processing node to hold
all of the data for a large job. When the goal is to simulate
an extremely large system on a distributed-memory
computer to allow for the larger total memory of the
distributed-memory computer and also to take advantage
of a larger number of processors, a different
approach is needed. Since the link-cell algorithm we
used in the sequential and replicated data approaches
breaks the simulation space into domains, it seems natural
to map this geometrical, or domain, decomposition
onto separate processors. Doing so is the essence of the
parallel link-cell technique [11,
15]2. By
subdividing the
physical volume among processors, most of the computation
becomes local and the communication is minimized
so there is, in principle, an N / P scaling (N =
number of particles, P = number of processors), an efficient
approach for distributed-memory computers and
networks of workstations.The basic idea is this:
Split the total volume into P domains, where P is the
number of processors. If we choose a 1D decomposition
("slices of bread"), then the pth processor is
responsible for particles whose x-coordinates lie in the
range
![]()
Similar equations apply for 2D and 3D decompositions for simulation dimensions Ly and Lz. Whether the decomposition is 1D, 2D, or 3D depends on the number of processors. An algorithm due to Plimpton [17] is used to assign P processors to a 3D box so as to minimize the surface area (and hence, yield a good load balancing). For P processors and a given simulation box of dimensions Lx, Ly, and Lz, the algorithm is the following. Loop through all factorizations of P into Px, Py, and Pz processors, computing the area of the resulting box, and pick the one with the minimum surface area. Of multiple equal surface areas (for example, Px, Py, Pz = (4,2,2), (2,4,2), (2,2,4)), pick the one with Px ≤ Py ≤ Pz.
Each processor runs a link-cell program corresponding
to a particular domain of the simulation box. For
example, in Fig. 5 nine processors were used
to divide the 2D simulation space into domains, each processor
being assigned to one of the nine domains (here we
assign each processor an index, where the indices start
at 0). We also show the central processor's domain
being subdivided into cells. To complete the force calculation
on particles in cells at the interface between
processors, each processor needs to know information
about the particles in the adjacent cells, which now will
be found on a neighboring processor. To handle this
problem we construct an extra layer of cells on each
processor at the interface between processors. At each
timestep we communicate information across the interface
between adjacent processors describing the particles
in these edge cells (subroutine EXTVOL). The
information that has to be passed by EXTVOL is the
information needed for the forces calculations, which
is,
and the unique particle number discussed
below. For example, in Fig. 5 we show
processors 3, 4, and 5 and we also show, with dashed lines, the cells in
processors 3 and 5 which are adjacent to 4. Information
about the particles in these dashed line cells is communicated
to 4, making up "ghost" cells on 4. To complete
the "extended volume" needed on processor 4 to compute
the forces on all the particles it "owns", information
is communicated (swapped) across the interface
between adjacent processors in the Y direction as well.
To account for the cross-hatched corner cells, the swap
in Y includes information about not only particles that
the processor owns but also information about "other"
particles in "ghost" cells. So processor 7 sends information
about particles in the dashed line cells as well as
the cross-hatched cells (obtained from processors 6 and
8) to processor 4. At this point processor 4 has all the
information it needs to calculate forces on all the particles
it owns (processor 4 now has information about all
the particles shown in the extended volume comprised
of the domain of processor 4 plus the surrounding
dashed line ghost cells), and similarly all of the other
processors have all the information they need. These
exchanges of data can be achieved by one set of communications
between the processors. A processor only
has to communicate once with all of its neighbors, so
each processor communicates with at most four other
processors (six in 3D), rather than, say 64 in a 64
processor replicated data calculation. Now on each
processor, form a link-cell list of all particles in the
original volume plus the extended volume. Loop over
the particles in the original volume, calculating the
forces on them and their pair particle (for conservation
of momentum). Care must be taken to add these pair
particle forces on particles in the extended volume to
the forces on the pair particles in the processor "owning"
them, which necessitates an extra set of communications
between processors (the reverse of the communication
swaps setting up the "ghost" cells). This extra
communication step is necessary in the QDPD method
since the interparticle force calculation involves the use
of a random number for thermal effects and momentum
conservation requires that the same random number be
used in the equal and opposite force calculation.
Finally calculate the new positions of all particles and
move the particles which have left a processor to their
new home processor. If particles move into domains
controlled by other processors, information about the
particle (the particle’s properties) must be moved to its
new "home" processor. Again these exchanges of data
can be achieved by one set of communications between
the processors, and are implemented in subroutine MOVPAR. In this set of
communications, all information about a particle needed for one time step must
be communicated, not just the information needed for the forces calculation (the
information communicated in EXTVOL as explained above).
Our spatial decomposition program has the following
added features. First, following Plimpton [17], we
distinguish between "owned" particles and "other" particles, those particles
that are on neighboring processors and are part of the extended volume on any given
processor. For "other" particles, only the information needed to
calculate forces is communicated to neighboring processors. Second, the QDPD
technique is being applied to suspensions, so there are two types of particles,
"free" particles and particles belonging to ellipsoids (the solid
inclusions). A novel feature of this work is that we explicitly do not keep all particles
belonging to the same ellipsoid on the same processor. Since the largest ellipsoid that might be built can consist
of as much as 50 % of all particles, that would be difficult if not impossible to handle without serious
load-balancing implications. What we do is assign each particle a unique particle number when it is read in.
Each processor has the list of ellipsoid definitions consisting of lists of particles defined by these unique particle
numbers. Each processor computes solid inclusion properties for each particle it "owns", and these properties
are globally summed (using MPI_REDUCE [1, 18]
over all processors so that all processors have the same solid inclusion properties. Since there are only a small
number of ellipsoids (relative to the number of particles), the amount of communication necessary for the
global sums is small and the amount of extra memory is also relatively small. Hence it is an efficient technique.
1 Certain commercial equipment, instruments, or materials are identified in this paper to foster understanding. Such identification does not imply endorsement by the National Institute of Standards and Technology, nor does it imply that the materials or equipment identified are necessarily the best available for the purpose.
2 See Plimpton [16] for excellent discussions of all fast parallel algorithms.