I'm investigating some very large performance variation and have reduced
the issue to a very simple MPI_Allreduce benchmark.  The variability
does not occur for serial jobs, but it does occur within single nodes.
I'm not at all convinced that this is an Open MPI-specific issue (in
fact the same variance is observed with MVAPICH2 which is an available,
but not "recommended", implementation on that cluster) but perhaps
someone here can suggest steps to track down the issue.

The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), connected
with QDR InfiniBand.  The benchmark loops over

  
MPI_Allgather(localdata,nlocal,MPI_DOUBLE,globaldata,nlocal,MPI_DOUBLE,MPI_COMM_WORLD);

with nlocal=10000 (80 KiB messages) 10000 times, so it normally runs in
a few seconds.  Open MPI 1.4.1 was compiled with gcc-4.3.3, and this
code was built with mpicc -O2.  All submissions were 8 process, timing
and host results are presented below in chronological order.  The jobs
were run with 2-minute time limits (to get through the queue easily)
jobs are marked "killed" if they went over this amount of time.  Jobs
were usually submitted in batches of 4.  The scheduler is LSF-7.0.

The HOST field indicates the node that was actually used, a6* nodes are
of the type described above, a2* nodes are much older (2-socket Opteron
2220 (dual core, 2.8 GHz)) and use a Quadrics network, the timings are
very reliable on these older nodes.  When the issue first came up, I was
inclined to blame memory bandwidith issues with other jobs, but the
variance is still visible when our job runs on exactly a full node,
present regardless of affinity settings, and events that don't require
communication are well-balanced in both small and large runs.

I then suspected possible contention between transport layers, ompi_info
gives

  MCA btl: parameter "btl" (current value: "self,sm,openib,tcp", data source: 
environment)

so the timings below show many variations of restricting these values.
Unfortunately, the variance is large for all combinations, but I find it
notable that -mca btl self,openib is reliably much slower than self,tcp.

Note that some nodes are used in multiple runs, yet there is no strict
relationship where some nodes are "fast", for instance, a6200 is very
slow (6x and more) in the first set, then normal on the subsequent test.
Nevertheless, when the same node appears in temporally nearby tests,
there seems to be a correlation (though there is certainly not enough
data here to establish that with confidence).

As a final observation, I think the performance in all cases is
unreasonably low since the same test on a (unrelated to the cluster)
2-socket Opteron 2356 (quad core, 2.3 GHz) always takes between 9.75 and
10.0 seconds, 30% faster than the fastest observations on the cluster
nodes with faster cores and memory.

#  JOB       TIME (s)      HOST

ompirun
lsf.o240562 killed       8*a6200
lsf.o240563 9.2110e+01   8*a6200
lsf.o240564 1.5638e+01   8*a6237
lsf.o240565 1.3873e+01   8*a6228

ompirun -mca btl self,sm
lsf.o240574 1.6916e+01   8*a6237
lsf.o240575 1.7456e+01   8*a6200
lsf.o240576 1.4183e+01   8*a6161
lsf.o240577 1.3254e+01   8*a6203
lsf.o240578 1.8848e+01   8*a6274

prun (quadrics)
lsf.o240602 1.6168e+01   4*a2108+4*a2109
lsf.o240603 1.6746e+01   4*a2110+4*a2111
lsf.o240604 1.6371e+01   4*a2108+4*a2109
lsf.o240606 1.6867e+01   4*a2110+4*a2111

ompirun -mca btl self,openib
lsf.o240776 3.1463e+01   8*a6203
lsf.o240777 3.0418e+01   8*a6264
lsf.o240778 3.1394e+01   8*a6203
lsf.o240779 3.5111e+01   8*a6274

ompirun -mca self,sm,openib
lsf.o240851 1.3848e+01   8*a6244
lsf.o240852 1.7362e+01   8*a6237
lsf.o240854 1.3266e+01   8*a6204
lsf.o240855 1.3423e+01   8*a6276

ompirun
lsf.o240858 1.4415e+01   8*a6244
lsf.o240859 1.5092e+01   8*a6237
lsf.o240860 1.3940e+01   8*a6204
lsf.o240861 1.5521e+01   8*a6276
lsf.o240903 1.3273e+01   8*a6234
lsf.o240904 1.6700e+01   8*a6206
lsf.o240905 1.4636e+01   8*a6269
lsf.o240906 1.5056e+01   8*a6234

ompirun -mca self,tcp
lsf.o240948 1.8504e+01   8*a6234
lsf.o240949 1.9317e+01   8*a6207
lsf.o240950 1.8964e+01   8*a6234
lsf.o240951 2.0764e+01   8*a6207

ompirun -mca btl self,sm,openib
lsf.o240998 1.3265e+01   8*a6269
lsf.o240999 1.2884e+01   8*a6269
lsf.o241000 1.3092e+01   8*a6234
lsf.o241001 1.3044e+01   8*a6269

ompirun -mca btl self,openib
lsf.o241013 3.1572e+01   8*a6229
lsf.o241014 3.0552e+01   8*a6234
lsf.o241015 3.1813e+01   8*a6229
lsf.o241016 3.2514e+01   8*a6252

ompirun -mca btl self,sm
lsf.o241044 1.3417e+01   8*a6234
lsf.o241045 killed       8*a6232
lsf.o241046 1.4626e+01   8*a6269
lsf.o241047 1.5060e+01   8*a6253
lsf.o241166 1.3179e+01   8*a6228
lsf.o241167 2.7759e+01   8*a6232
lsf.o241168 1.4224e+01   8*a6234
lsf.o241169 1.4825e+01   8*a6228
lsf.o241446 1.4896e+01   8*a6204
lsf.o241447 1.4960e+01   8*a6228
lsf.o241448 1.7622e+01   8*a6222
lsf.o241449 1.5112e+01   8*a6204

ompirun -mca btl self,tcp
lsf.o241556 1.9135e+01   8*a6204
lsf.o241557 2.4365e+01   8*a6261
lsf.o241558 4.2682e+01   8*a6214
lsf.o241560 2.0481e+01   8*a6262

ompirun -mca btl self,sm,openib
lsf.o241635 1.4234e+01   8*a6204
lsf.o241636 1.2024e+01   8*a6214
lsf.o241637 1.2773e+01   8*a6214
lsf.o241638 killed       8*a6214
lsf.o241684 1.8050e+01   8*a6261
lsf.o241686 1.3567e+01   8*a6203
lsf.o241687 1.5020e+01   8*a6228
lsf.o241688 2.2387e+01   8*a6225

ompirun -mca btl self,openib
lsf.o241723 3.0060e+01   8*a6228
lsf.o241724 3.4366e+01   8*a6244
lsf.o241725 3.0033e+01   8*a6203
lsf.o241726 3.0499e+01   8*a6228
lsf.o241741 3.0483e+01   8*a6234
lsf.o241743 6.9527e+01   8*a6225
lsf.o241744 3.0945e+01   8*a6244
lsf.o241745 3.2120e+01   8*a6220

mvapich2 1.4.1
lsf.o243902 1.3661e+01   8*a6243
lsf.o244832 2.9471e+01   8*a6250
lsf.o244833 2.8425e+01   8*a6250
lsf.o244835 1.3644e+01   8*a6261
lsf.o244837 1.3793e+01   8*a6244
lsf.o244838 2.6907e+01   8*a6250
lsf.o247496 1.3632e+01   8*a6244
lsf.o247497 1.3368e+01   8*a6244
lsf.o247499 1.4120e+01   8*a6252


Any suggestions?

Jed

Reply via email to