Figure 10 shows the performance of our codes on two distributed memory architectures. In the figure we plot normalized processing time, which is the ratio of the time to complete a benchmark run on multiple processors divided by the time to compute a benchmark run on a single processor.

Fig. 10. Logarithm (base e) of normalized CPU time (seconds) versus number of processors. The performance of the replicated data version degrades much more quickly than the spatial decomposition version of the same code.
For the replicated data version of our code, the best we could do was a factor of 4.3 improvement on 16 processors on a Linux cluster with Myrinet. In comparison, the spatial decomposition version of the code, running on the same Linux cluster showed a greatly enhanced performance (a factor of 10.5 on 16 processors). The best results, for the spatial decomposition version, show a speed up of a factor of 24 on 27 200MHz Power3 processors on an IBM SP2, a distributed memory cluster, but with a high-speed interconnect which allows it to approach the scalability of a shared memory machine in many cases.
Our spatial decomposition code has proven effective in a shared memory environment [14] as well, where the speedups are a factor of 29 on 32 processors of an SGI Origin 3000 system and a factor of 50 on 64 processors of the same system. In contrast, for the replicated data parallelization, speedups are a factor of 17.5 on 24 processors of an SGI Origin 3000 [14]. Clearly, communication costs quickly become prohibitive for replicated data parallelizations on distributed memory architectures. Scaling to a very large number of processors is poor even in the shared memory environment, and it makes the replicated data approach almost unusable on distributed memory machines including those with high-speed interconnects like the IBM SP2 cluster.