We ran a series of timing tests in an effort to understand how performance of our implementation scales on different computer architectures. We have tested on an SGI Onyx with 12 R10000 processors running at 196 MHz, and an IBM SP2 with 37 RS/6000 processors, most running at 66 MHz. The same code and the same cases were run on the two systems. The results are presented in Tables I and II. The performance reported was somewhat affected by other jobs that were running at the same time that the tests were being run, although efforts were made to minimize this effect.
These data closely agree with a very simple model describing performance: T = P/N + S, where T, where is the total time for a single iteration, P is the time for the parallelizable computation, S is the time for the non-parallelizable computation, and N is the number of processors. The parallelizable computation is that portion of the processing that can be effectively distributed across the processors. The non-parallelizable computation includes processing that cannot be distributed; this includes time for inter-process communication as well as computation that must be performed either on a single processor, or must be done identically on all processors.
For example, the two-component fluid performance data for the SGI Onyx closely match this formula: T = 4.78 + 487.26/N s, where N is the number of processors. Similarly, the timings for the two component runs on the IBM SP2 closely match: T = 41.67 + 1198.45/N s. Formulae for the other cases are easily derived. Figures 9 and 10 present these results graphically.
Figure 9: Time in seconds for one iteration on the SGI Onyx.
Figure 10: Time in seconds for one iteration on the IBM SP2.
Much of the difference between the performance of these two systems is likely due simply to the relative computational speeds of each processor. But the difference in the serial overhead (4.78 s on the SGI versus 41.67 s on the IBM), is most likely due to the different memory architectures of the two systems. The SGI Onyx uses a Non-Uniform Memory Access (NUMA) architecture that enables processes to pass data to one another through shared memory. However, on the IBM SP2 no memory is shared and data must be transferred over an external high-speed network. Thus the overhead for message passing on the SGI Onyx is considerably lower than that on the IBM SP2.
The time for the parallelizable portion of the code is expected to be in proportion to the number of active sites, which depends on the porosity and the size of the volume. But the time for the non-parallelizable portion of the code is likely to be dominated by the inter-process communication. Assuming that communication time is roughly proportional to the amount of data transferred, the communication time should be proportional to the number of active sites on an XY plane.
So as we process larger systems, the time for the parallelizable portion of the code should increase proportionally with the cube of the linear size of the system, while the non-parallelizable portion should increase with the square of the linear size of the system. This means that for larger systems, a larger proportion of the time is in the parallelizable computation, and greater benefits can be derived from running on multiple processors.
These performance data give us a general idea of how long it takes to get practical results for real-world problems on the computing platforms tested. For example, a typical case requires about 10000 iterations to converge. So from the performance described above, a one-component run of the sample size and porosity (22%) described above will take about 41 h on one processor on an SGI Onyx. On four processors, the same run will take approximately 10.6 h. Approximate times for other sizes and porosities are easily calculated from the data above.