On Tue, 12 Aug 2008, Gus Correa wrote:
> Hello Daniel and list > > Could it be a problem with memory bandwidth / contention in multi-core? Yes, I believe we are somehow limited by memory performance. Here are some numbers from a dual Opteron 2352 system, which has much more memory bandwidth: #--------------------------------------------------- # Benchmarking PingPong # #processes = 2 # ( 6 additional processes waiting in MPI_Barrier) #--------------------------------------------------- #bytes #repetitions t[usec] Mbytes/sec 0 1000 0.86 0.00 1 1000 0.97 0.98 2 1000 0.95 2.01 4 1000 0.96 3.97 8 1000 0.95 7.99 16 1000 0.96 15.85 32 1000 0.99 30.69 64 1000 0.97 63.09 128 1000 1.02 119.68 256 1000 1.18 207.25 512 1000 1.40 348.77 1024 1000 1.75 556.75 2048 1000 2.59 753.22 4096 1000 5.10 766.23 8192 1000 7.93 985.13 16384 1000 14.60 1070.57 32768 1000 27.92 1119.23 65536 640 46.67 1339.16 131072 320 86.03 1453.06 262144 160 163.16 1532.21 524288 80 310.01 1612.88 1048576 40 730.62 1368.69 2097152 20 1449.72 1379.57 4194304 10 2884.90 1386.53 However, +/- 1200 MB/s (or +/ 1500 MB/s in case of the AMD system) is not even close to the memory performance limits the systems, so there should be room for optimization. After all, the openib btl manages to tranfer the data from the memory of oneprocess to the memory of another process just fine with more performance. > It has been reported in many mailing lists (mpich, beowulf, etc). > Here it seems to happen in dual-processor dual-core with our memory intensive > programs. MPICH2 manages to get about 5GB/s in shared memory performance on the Xeon 5420 system. > Have you checked what happens to the shared memory runs as you > you increase the number of active cores/processes? > Would it help to set the processor affinity in the shared memory runs? > > http://www.open-mpi.org/faq/?category=building#build-paffinity > http://www.open-mpi.org/faq/?category=tuning#using-paffinity Neither has any effect on the scores. Daniël