There are likely many issues going on here:

- large all to all operations are very stressful on the network, even if you have very low latency / high bandwidth networking such as DDR IB

- if you only have 1 IB HCA in a machine with 8 cores, the problem becomes even more difficult because all 8 of your MPI processes will be hammering the HCA with read and write requests; it's a simple I/O resource contention issue

- there are several different algorithms in Open MPI for performing alltoall, but they were not tuned for ppn>4 (honestly, they were tuned for ppn=1, but they still usually work "well enough" for ppn<=4). In Open MPI v1.3, we introduce the "hierarch" collective module, which should greatly help with ppn>4 kinds of scenarios for collectives (including, at least to some degree, all to all)

- per the "sm" thread, you might want to try with just IB (and not shared memory), just to see if that helps (I don't expect that it will, but every situation is different). Try running "mpirun --mca btl openib ..." (vs. "--mca btl ^tcp").



On Aug 15, 2008, at 5:00 PM, Kozin, I (Igor) wrote:

Hello,
I would really appreciate any advice on troubleshooting/tuning Open MPI over ConnectX. More details about our setup can be found here http://www.cse.scitech.ac.uk/disco/database/search-machine.php?MID=52 Single process per node (ppn=1) seems to be fine (the results for IMB can be found here http://www.cse.scitech.ac.uk/disco/database/search-pmb.php) However there is a problem with Alltoall and ppn=8 mpiexec --mca btl ^tcp -machinefile hosts32x8.txt -n 128 src/IMB- MPI1.openmpi -npmin 128 Alltoall
      #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
           0         1000         0.01         0.02         0.01
           1         1000        95.70        95.87        95.81
           2         1000       107.59       107.64       107.62
           4         1000       108.46       108.52       108.49
           8         1000       112.25       112.30       112.28
          16         1000       121.07       121.12       121.10
          32         1000       154.12       154.18       154.15
          64         1000       207.85       207.93       207.89
         128         1000       334.52       334.63       334.58
         256         1000      9303.66      9305.98      9304.99
         512         1000      8953.59      8955.71      8955.08
        1024         1000      8607.87      8608.78      8608.42
        2048         1000      8642.59      8643.30      8643.03
        4096         1000      8478.45      8478.64      8478.58

I’ve tried playing with various parameters but to no avail. The step up for the same message size is noticeable for n=64 and 32 as well but progressively less so. Even more surprising is the fact that Gigabit performs better for this message size. mpiexec --mca btl self,sm,tcp --mca btl_tcp_if_include eth1 - machinefile hosts32x8.txt -n 128 src/IMB-MPI1.openmpi -npmin 128 Alltoall
           8         1000       598.66       599.11       598.95
          16         1000       723.07       723.48       723.29
          32         1000      1144.79      1145.46      1145.18
          64         1000      1850.25      1850.97      1850.66
         128         1000      3794.32      3795.23      3794.82
         256         1000      5653.55      5653.97      5653.81
         512         1000      7107.96      7109.90      7109.66
        1024         1000     10310.53     10315.90     10315.63
        2048         1000    350066.92    350152.90    350091.89
        4096         1000     42238.60     42239.53     42239.27
        8192         1000    112781.11    112782.55    112782.10
       16384         1000   2450606.75   2450625.01   2450617.86
Unfortunately this task never completes…

Thanks in advance. Sorry for the long post.
Igor

PS I’m following the discussion on slow sm btl but not sure if this particular problem is related or not. BTW the Open MPI build I’m using is for Intel compiler. PPS MVAPICH and MVAPICH2 behave much better but not perfect too. Unfortunately I have other problems with them.


I. Kozin  (i.kozin at dl.ac.uk)
STFC Daresbury Laboratory, WA4 4AD, UK
http://www.cse.clrc.ac.uk/disco


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems


Reply via email to