FWIW, we have made some improvements to shared memory performance in the upcoming v1.3 series. I won't ask you to test a v1.3 tarball right now because there's a gnarly bug in the shared memory support that George is working to fix -- hopefully he'll fix it soon and you can see if the performance is a bit better in v1.3.

On Aug 13, 2008, at 3:52 AM, Lenny Verkhovsky wrote:

Hi,

just for the try - can run np 2

( Ping Pong test is for 2 processes only )


On 8/13/08, Daniël Mantione <daniel.manti...@clustervision.com> wrote:

On Tue, 12 Aug 2008, Gus Correa wrote:

> Hello Daniel and list
>
> Could it be a problem with memory bandwidth / contention in multi- core?


Yes, I believe we are somehow limited by memory performance. Here are
some numbers from a dual Opteron 2352 system, which has much more memory
bandwidth:


#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 6 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec

            0         1000         0.86         0.00
            1         1000         0.97         0.98
            2         1000         0.95         2.01
            4         1000         0.96         3.97
            8         1000         0.95         7.99
           16         1000         0.96        15.85
           32         1000         0.99        30.69
           64         1000         0.97        63.09
          128         1000         1.02       119.68
          256         1000         1.18       207.25
          512         1000         1.40       348.77
         1024         1000         1.75       556.75
         2048         1000         2.59       753.22
         4096         1000         5.10       766.23
         8192         1000         7.93       985.13
        16384         1000        14.60      1070.57
        32768         1000        27.92      1119.23
        65536          640        46.67      1339.16
       131072          320        86.03      1453.06
       262144          160       163.16      1532.21
       524288           80       310.01      1612.88
      1048576           40       730.62      1368.69
      2097152           20      1449.72      1379.57
      4194304           10      2884.90      1386.53

However, +/- 1200 MB/s (or +/ 1500 MB/s in case of the AMD system) is not
even close to the memory performance limits the systems, so there
should be room for optimization.

After all, the openib btl manages to tranfer the data from the memory of
oneprocess to the memory of another process just fine with more
performance.


> It has been reported in many mailing lists (mpich, beowulf, etc).
> Here it seems to happen in dual-processor dual-core with our memory intensive
> programs.


MPICH2 manages to get about 5GB/s in shared memory performance on the
Xeon 5420 system.


> Have you checked what happens to the shared memory runs as you
> you increase the number of active cores/processes?
> Would it help to set the processor affinity in the shared memory runs?
>
> http://www.open-mpi.org/faq/?category=building#build-paffinity
> http://www.open-mpi.org/faq/?category=tuning#using-paffinity


Neither has any effect on the scores.


Daniël
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems


Reply via email to