generic form of this subroutine is used to calculate
The term, C, and array, b, are calculated in essentially the same manner as in the serial code; there are contributions to each from the x,y,z faces, the xy, xz, yz edges and the corners of the original microstructure dataset. But C and b are per-processor values in the parallel code. This means that each processor will calculate its own contribution, based on its chunk of data, to the overall value. Once a node determines its contribution, all partial results are sent to the master node who adds them together and passes back the result of this operation and broadcasts it to all nodes.
The nominal loops to calculate the positional contributions to C and b are closely related to the serial case. One usually loops over 2 of the 3 Cartesian coordinates, (x,y,z). In the code, a loop in the x direction uses the variable i as its index, viz: i = 1, nx. Similar arrangements are made for the y and z directions using the variables j and k, respectively as in j = 1, ny and k = 1, nz. Some k-type loops in the subroutine (z-direction) use the limits k = 1, nz − 1, but it is important to be mindful that the data, as well as other large arrays, are split across the processing nodes in the z-direction. Additional accommodations and precautions have to be made for this fact, but at the same time the code must be generic enough so all processors can execute it. It is especially important for the k = 1, nz − 1 loops.
This is accomplished by introducing a new variable, dn, in concert with an if statement. dn is initially assigned to d2, but if d2 equals nz, then it takes on the value nz − 1. Note that this only occurs at the processor with the highest rank value. So during these calculations, all levels on the lower ranks are included and the d2 is excluded only on the highest rank.
Another interesting artifact of this calculation occurs when calculating the contributions for C and b from the Z-face, x=nx z=nz edge, y=ny z=nz edge and x=nx y=ny z=nz corner. In these 4 cases, only the highest rank processor is needed since this is the one where k has values of nz and nz − 1.
The accuracy of C is increased in this algorithm. At each instance when the contribution of C is calculated (called cterm in the code), it is compared to zero (0). A positive cterm is added to cpos. Likewise a negative cterm is added to cneg. At the end of the calculation, cpos and cneg are summed on a node, which yields the overall C per processor. This prevents adding very large numbers of one sign with very small numbers of the opposite sign.
As mentioned, the elements of array b are calculated with a per-processor methodology as well. In the original serial codes, ELECFEM3D.f and ELAS3D.f, the elements of b are calculated as b(ib(m, is(mm))) and b(ib(m, is(mm)); nn), respectively. Notice the dependence on the hash table array, ib. In the parallel versions, the need for the hash table ib is eliminated and b is an array of rank 3 or 4, respectively.
But a calculation for an element of b is more complicated than generating a constant term like C. One must notice that the corresponding loop variables (i, j, k) are not the same as the indices of b, namely (ipx, ipy, ipz) which is currently being calculated. These indices can be thought of as a set of influenced positional values. Therefore, a calculation using vox(i, j, k) will influence the values of 27 separate and distinct elements of b, which correspond to its 27 nearest neighbors. In other words, a contribution to an element of b comes from 27 distinct elements of vox. The subroutine ipxyz is used to generate the values of an influenced triplet (ipx, ipy, ipz) using a given set of parameters (i, j, k and mm). So one saves on the overall amount of memory for this calculation by eliminating ib, but the price to be paid is executing ipxyz multiple times. After the distinct faces, edges and corner loops are finished, b is essentially complete. However no one node has the final values of b(i, j, d1) and b(l, m, d2). Part of the results are found on contiguous nodes, i.e. processor P − 1 and P + 1, respectively. Therefore, it is necessary to pass the bottom ghost layer of processor P + 1 to P then add to the d2 level of P once it is at the requesting node. This addition completes the calculation of b. A similar set of calculations involving the top layer of P − 1 and the d1 level of P are also required. In a succinct manner, it can be expressed as:
where capital B represents the final value of the element and lower-case b's represent the partial results.
At this point b is completely determined and updating the top and bottom ghost layers is the final step. This is accomplished by another use of subroutines b2t_dp and t2b_dp.
THERMAL3D_MPI.f uses a larger and slightly more complex form of FEMAT_MPI. Remember that THERMAL3D_MPI.f incorporates 1 additional large array, T. T is a linear term in displacements (like b) but arises from thermal strains and the constant term Y is similar to C, but arising from the thermal strains, not the applied or macrostrains. Since the overall system strains (macrostrains) are dynamic variables in the thermoelastic problem, the value of C will change throughout the operation of the program. However, the value of Y is constant, since it is a function only of the elastic moduli and the thermal strains, which are system variables and thus do not change. In the serial case, the array T was defined differently than b was in ELAS3D. Instead of being defined as T(ns, 3), it was defined as T(ns + 2, 3). This additional expansion of dimensions is handled by making 2 arrays for the parallel counterpart. The first is to dimension T as T(nx, ny, d1 − 1 : d2 + 1,3) (like b in ELAS3D MPI.f) but the 6 extra terms of T(ns + 1 : ns + 2, 3) are put into a new array called T2 and dimensioned T2(2, 3). Therefore, the generic array in THERMAL3D_MPI.f, X(ns + 2; 3) is converted to 2 arrays, one with the standard parallel dimensions of X(nx, ny, d1 − 1 : d2 + 1) and then also X2(2,3).
Calculations for T elements are handled in a similar manner as b from ELAS3D_MPI.f. Again, the (i, j, k) indices of the loop do not correspond to the element in question. The program also uses subroutine ipxyz and calculates a set of influenced positional parameters as well. To complete the calculation for T, implement equations similar to (16) and (17) above and do the final updating of the ghost layers.
Although the T2 array is small (2,3), it has contributions from the faces, edges and corner like the array b in ELAS3D_MPI.f. At the end of determining these per processor arrays, one must also form the element-by-element sum to end up with the nal version of T2. This is accomplished by using a call to MPI_ALLREDUCE:
![]()

t2temp is a temporary variable used to store the global sum of the per-processor T2(ipp, jpp) values. Then it reassigns T2 (ipp, jpp) with the global sum before it ends. After calculating T2, this implementation also has to calculate an array b. This term is linear in displacements as well and is generated identically to b in ELAS3D_MPI.f.