Next: Conclusion Up: Main Previous: Relative permeability


Performance results

We ran a series of timing tests in an effort to understand how performance of our implementation scales on different computer architectures. We have tested on an SGI Onyx with 12 R10000 processors running at 196MHz, an IBM SP2 with 37 RS/6000 processors, most running at 66MHz. The same code and the same cases were run on the two systems. The results are presented in Tables 1 and 2. The performance reported was somewhat affected by other jobs that were running at the same time that the tests were being run, although efforts were made to minimize the effect.

# # Fluid Components
Processors 1 2 3
       
1 14.70 24.70 33.27
2 7.39 12.22 16.69
4 3.80 6.23 8.57
8 2.14 3.48 4.68
# # Fluid Components
Processors 1 2 3
       
1 38.48 62.36 99.93
2 19.30 31.32 51.16
4 10.44 16.83 26.97
8 6.86 10.01 15.54
16 4.37 6.00 8.30
   
Table 1. Execution times in seconds for Table 2. Execution time in seconds for
one iteration on the SGI Onyx. one iteration on the IBM SP2.

These data closely agree with a very simple model describing performance: T = P/N + S, where T is the total time for a single iteration, P is the time for the parallelizable computation, S is the time for the non-parallelizable computation, and N is the number of processors. The parallelizable computation is that portion of the processing that can be effectively distributed across the processors. The non-parallelizable computation includes processing that cannot be distributed; this includes time for inter-process communication as well as computation that must be performed either on a single processor, or must be done identically on all processors.

For example, the two-component fluid performance data for the SGI Onyx, closely match this formula: T = 4.78 + 487.26/N seconds, where N is the number of processors. Similarly, the timings for the two component runs on the IBM SP2 closely match: T = 41.67 + 1198.45/N seconds. Formulae for the other cases are easily derived. Figures 6 and 7 present these results graphically.


\begin{figure}
\par\begin{center}
\setlength{\tabcolsep}{15pt}
\begin{tabular}{...
...GI Onyx.
&
iteration on the IBM SP2.
\end{tabular}\end{center}\par\end{figure}

Much of the difference between the performance of these two systems is likely due simply to the relative computational speeds of each processor. But the difference in the serial overhead (4.78 seconds on the SGI versus 41.67 seconds on the IBM), is most likely due to the different memory architectures of the two systems. The SGI Onyx uses a Non-Uniform Memory Access (NUMA) architecture that enables processes to pass data to one another through shared memory, However, on the IBM SP2 no memory is shared and data must be transferred over an external high-speed network. Thus the overhead for message passing on the SGI Onyx is considerably lower than that on the IBM SP2. We intend to run timing tests to measure the difference in message passing overhead.

The time for the parallelizable portion of the code is expected to be in proportion to the number of active sites, which depends on the porosity and the size of the volume. But the time for the non-parallelizable portion of the code is likely to be dominated by the inter-process communication. Assuming that communication time is roughly proportional to the amount of data transferred, the communication time should be proportional to the number of active sites on an XY plane.

So as we process larger systems, the time for the parallelizable portion of the code should increase proportionally with the cube of the linear size of the system, while the non-parallelizable portion should increase with the square of the linear size of the system. This means that for larger systems, a larger proportion of the time is in the parallelizable computation, and greater benefits can be derived from running on multiple processors. We are still in the process of investigating the scaling of the software's performance with system size.

These performance data give us a general idea of how long it takes to get practical results for real-world problems on the computing platforms tested. For example, a typical case requires about 10000 iterations to converge. So from the performance described above, a one-component run of the sample size and porosity (22 %) described above will take about 41 hours on one processor on an SGI Onyx. On four processors, the same run will take approximately 10.6 hours. Approximate times for other sizes and porosities are easily calculated from the data above.


Next: Conclusion Up: Main Previous: Relative permeability