There are likely many issues going on here:
- large all to all operations are very stressful on the network, even
if you have very low latency / high bandwidth networking such as DDR IB
- if you only have 1 IB HCA in a machine with 8 cores, the problem
becomes even more difficult because all 8 of your MPI processes will
be hammering the HCA with read and write requests; it's a simple I/O
resource contention issue
- there are several different algorithms in Open MPI for performing
alltoall, but they were not tuned for ppn>4 (honestly, they were tuned
for ppn=1, but they still usually work "well enough" for ppn<=4). In
Open MPI v1.3, we introduce the "hierarch" collective module, which
should greatly help with ppn>4 kinds of scenarios for collectives
(including, at least to some degree, all to all)
- per the "sm" thread, you might want to try with just IB (and not
shared memory), just to see if that helps (I don't expect that it
will, but every situation is different). Try running "mpirun --mca
btl openib ..." (vs. "--mca btl ^tcp").
On Aug 15, 2008, at 5:00 PM, Kozin, I (Igor) wrote:
Hello,
I would really appreciate any advice on troubleshooting/tuning Open
MPI over ConnectX. More details about our setup can be found here http://www.cse.scitech.ac.uk/disco/database/search-machine.php?MID=52
Single process per node (ppn=1) seems to be fine (the results for
IMB can be found here http://www.cse.scitech.ac.uk/disco/database/search-pmb.php)
However there is a problem with Alltoall and ppn=8
mpiexec --mca btl ^tcp -machinefile hosts32x8.txt -n 128 src/IMB-
MPI1.openmpi -npmin 128 Alltoall
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.01 0.02 0.01
1 1000 95.70 95.87 95.81
2 1000 107.59 107.64 107.62
4 1000 108.46 108.52 108.49
8 1000 112.25 112.30 112.28
16 1000 121.07 121.12 121.10
32 1000 154.12 154.18 154.15
64 1000 207.85 207.93 207.89
128 1000 334.52 334.63 334.58
256 1000 9303.66 9305.98 9304.99
512 1000 8953.59 8955.71 8955.08
1024 1000 8607.87 8608.78 8608.42
2048 1000 8642.59 8643.30 8643.03
4096 1000 8478.45 8478.64 8478.58
I’ve tried playing with various parameters but to no avail. The step
up for the same message size is noticeable for n=64 and 32 as well
but progressively less so. Even more surprising is the fact that
Gigabit performs better for this message size.
mpiexec --mca btl self,sm,tcp --mca btl_tcp_if_include eth1 -
machinefile hosts32x8.txt -n 128 src/IMB-MPI1.openmpi -npmin 128
Alltoall
8 1000 598.66 599.11 598.95
16 1000 723.07 723.48 723.29
32 1000 1144.79 1145.46 1145.18
64 1000 1850.25 1850.97 1850.66
128 1000 3794.32 3795.23 3794.82
256 1000 5653.55 5653.97 5653.81
512 1000 7107.96 7109.90 7109.66
1024 1000 10310.53 10315.90 10315.63
2048 1000 350066.92 350152.90 350091.89
4096 1000 42238.60 42239.53 42239.27
8192 1000 112781.11 112782.55 112782.10
16384 1000 2450606.75 2450625.01 2450617.86
Unfortunately this task never completes…
Thanks in advance. Sorry for the long post.
Igor
PS I’m following the discussion on slow sm btl but not sure if this
particular problem is related or not. BTW the Open MPI build I’m
using is for Intel compiler.
PPS MVAPICH and MVAPICH2 behave much better but not perfect too.
Unfortunately I have other problems with them.
I. Kozin (i.kozin at dl.ac.uk)
STFC Daresbury Laboratory, WA4 4AD, UK
http://www.cse.clrc.ac.uk/disco
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems