In a parallel environment with a set of N processors, the program is typically set up such that one processor is arbitrarily selected as the root node (rank=0) and the others as workers (rank=1...N−1). For a given processor, P, we call processors with rank equal to P − 1 and P + 1, P's south and north neighbor, respectively. Root is in charge of the I/O, assigning data to the workers (using MPI) and does its share of calculating. It should be noted that the user does not actually assign the root or worker identity to any specific processor in the cluster, but the program makes the requirement that the conditions exist and it is the operating system's duty to carry it out. The user can always specify a processor to carry out a certain function in the code. See Section 4, MPI Primer for details on how each processor is uniquely identified within a program.
As mentioned previously, each matrix element can be calculated individually and independently from each other. So these parallel programs take advantage of the 3−D nature of the data (stored in array pix) by splitting it (along the z-direction) across multiple processing nodes. Each matrix element is addressed by a unique triplet of (x, y, z) coordinates and only portions (z-specific) of these large arrays exist on all the processors. This data is divided as evenly as possible over N compute nodes in the z direction. The number of data layers, Nk, each of the N processors receives per array is approximately . Therefore, the data model is to take a large array and treat it as nz 2-D arrays with dimensions of (nx, ny). Now each processing node only has to dedicate (Figure 1) the amount of memory to data storage for an equivalent sized problem on a serial machine. Theoretically, this calculation should speed up by a factor of N the amount of time to execute the same problem on a serial based machine. Additionally, problems which are N times as large can be run as well.
Figure 1: Depiction of Data Set split across 8 processing nodes.
Figure 2: Depiction of d1 and d2 values for Root and Processing Node 1.
Given the magnitude of nz and the number of processors, root calculates the number of data layers of pix each processing node receives. The lower and upper limits of the z extents for each processor are d1 and d2. Each node gets its own copy of the layers of pix from root; root stores a master copy of the d1 and d2 values for all the nodes in arrays d1s and d2s. Next, root passes the contiguous layers of the original data in which the value of pix's k indices lie (d1 ≤ k ≤ d2) on the proper processing node. The proper assignment of an array of this type using FORTRAN90 notation is: pix(nx, ny, d1 : d2). The last index in the form d1 : d2, is the range of the k values used. For a given processing node, these values are unique.
The inherent question after splitting the original data across a number of processing nodes is: Does a node have all the data it needs to carry out its assigned tasks? We know for these problems, which need nearest neighbor information, that they cannot have the required data after the initial split due to the user imposed boundaries on the dataset. Therefore inter-node communication (data transfer) is necessary. This requires the processors to know which nodes have the data they need and a mechanism for the data transfer.
Since a voxel needs information from its nearest neighbors to perform a correct calculation, problems arise when a processor attempts to calculate using a voxel located in either its top layer (z = d2) or its bottom layer (z = d1). Since this problem arises for all voxels in their respective d1 or d2 layers, a given node will need an entire data slice (one 2-d array) from its north and south neighbors, respectively. To be exact, processor P needs the south node (P − 1) to send its values of pix(i, j, d2) and the north node (P + 1) to send its values of pix(i, j; d1). In another notation, using the rank value as a subscript, processor P needs pix(i, j, d2)P − 1 and pix(i, j, d1)P + 1.
The preferred way for handling this situation is to increase the z-size of the array on each node by 2. The new layers occupy k = d1 − 1 and k = d2 + 1 per processor. They are referred to commonly as ghost layers (Figure 3). These layers are created before any of the calculations proceed since pix does not change during a calculation. This method allows the calculations to proceed uninterrupted unless global sums or other similar actions are called for. We define a new array called vox which is a copy of pix, but also containing the 2 extra data layers. It has no serial counterpart, but will function like pix from the serial code. It is dimensioned in FORTRAN90 as: vox(i, j, d1 − 1: d + 1). To emphasize its identity,
This makes the total amount of memory usage per node increase slightly. However, it obviates the need for additional inter-node communication during a given calculation that would increase the overall run time of the job. Also remember that one is gaining substantial memory savings compared to the serial version, so this cost is acceptable.
This situation gives rise to two special cases, namely: What is considered south of processor 0 and north of processor N − and north of process N − 1 is processor 0. This leads to the following assignments.
Figure 3: Depiction of Top and Bottom Ghost Layers from Root and Node 1. d2+1(Node1) = d1(Node2); d1-1(Node1) = d2(Root); d2+1(Root)= d2(Node1) but, d1-1(Root) = d2 layer of Nth processor.
In the serial code, memory allocation of the array pix and subsequent large arrays is handled in a static way by DIMENSION statements. In each statement, the user substitutes the numerical value of ns, where ns = nx x ny x nz, into the DIMENSION statement of each individual array. The parallel versions incorporate FORTRAN90's ALLOCATABLE and ALLOCATE statements. Dynamic memory allocation for the large arrays is based on the values of nx, ny, nz, d1 and d2. The user only needs to correctly enter the values for nx, ny and nz and the program handles the necessary allocation procedures. Assuming nx, ny, nz, d1 and d2 are already known, then examples of using ALLOCATE per node for arrays pix and vox are:
In this example, only root allocates memory for pix and all processors allocate memory space for its portion of vox. Root is the only processor which needs the entire pix array since it must pass out specific allotments to the workers. In conjunction with the DEALLOCATE statement, the memory used for pix is released after all passing of data is complete. Also vox is defined by its d1 and d2 limits and not the entire value, nz. This small range is at the heart of defining subsections of arrays per processor for parallel computations. Furthermore, this type of memory allocation used with the array vox is applied to all the large arrays found \throughout all the programs. See Table 1 for a description of the array dimensions per program.