I'm investigating some very large performance variation and have reduced the issue to a very simple MPI_Allreduce benchmark. The variability does not occur for serial jobs, but it does occur within single nodes. I'm not at all convinced that this is an Open MPI-specific issue (in fact the same variance is observed with MVAPICH2 which is an available, but not "recommended", implementation on that cluster) but perhaps someone here can suggest steps to track down the issue.
The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), connected with QDR InfiniBand. The benchmark loops over MPI_Allgather(localdata,nlocal,MPI_DOUBLE,globaldata,nlocal,MPI_DOUBLE,MPI_COMM_WORLD); with nlocal=10000 (80 KiB messages) 10000 times, so it normally runs in a few seconds. Open MPI 1.4.1 was compiled with gcc-4.3.3, and this code was built with mpicc -O2. All submissions were 8 process, timing and host results are presented below in chronological order. The jobs were run with 2-minute time limits (to get through the queue easily) jobs are marked "killed" if they went over this amount of time. Jobs were usually submitted in batches of 4. The scheduler is LSF-7.0. The HOST field indicates the node that was actually used, a6* nodes are of the type described above, a2* nodes are much older (2-socket Opteron 2220 (dual core, 2.8 GHz)) and use a Quadrics network, the timings are very reliable on these older nodes. When the issue first came up, I was inclined to blame memory bandwidith issues with other jobs, but the variance is still visible when our job runs on exactly a full node, present regardless of affinity settings, and events that don't require communication are well-balanced in both small and large runs. I then suspected possible contention between transport layers, ompi_info gives MCA btl: parameter "btl" (current value: "self,sm,openib,tcp", data source: environment) so the timings below show many variations of restricting these values. Unfortunately, the variance is large for all combinations, but I find it notable that -mca btl self,openib is reliably much slower than self,tcp. Note that some nodes are used in multiple runs, yet there is no strict relationship where some nodes are "fast", for instance, a6200 is very slow (6x and more) in the first set, then normal on the subsequent test. Nevertheless, when the same node appears in temporally nearby tests, there seems to be a correlation (though there is certainly not enough data here to establish that with confidence). As a final observation, I think the performance in all cases is unreasonably low since the same test on a (unrelated to the cluster) 2-socket Opteron 2356 (quad core, 2.3 GHz) always takes between 9.75 and 10.0 seconds, 30% faster than the fastest observations on the cluster nodes with faster cores and memory. # JOB TIME (s) HOST ompirun lsf.o240562 killed 8*a6200 lsf.o240563 9.2110e+01 8*a6200 lsf.o240564 1.5638e+01 8*a6237 lsf.o240565 1.3873e+01 8*a6228 ompirun -mca btl self,sm lsf.o240574 1.6916e+01 8*a6237 lsf.o240575 1.7456e+01 8*a6200 lsf.o240576 1.4183e+01 8*a6161 lsf.o240577 1.3254e+01 8*a6203 lsf.o240578 1.8848e+01 8*a6274 prun (quadrics) lsf.o240602 1.6168e+01 4*a2108+4*a2109 lsf.o240603 1.6746e+01 4*a2110+4*a2111 lsf.o240604 1.6371e+01 4*a2108+4*a2109 lsf.o240606 1.6867e+01 4*a2110+4*a2111 ompirun -mca btl self,openib lsf.o240776 3.1463e+01 8*a6203 lsf.o240777 3.0418e+01 8*a6264 lsf.o240778 3.1394e+01 8*a6203 lsf.o240779 3.5111e+01 8*a6274 ompirun -mca self,sm,openib lsf.o240851 1.3848e+01 8*a6244 lsf.o240852 1.7362e+01 8*a6237 lsf.o240854 1.3266e+01 8*a6204 lsf.o240855 1.3423e+01 8*a6276 ompirun lsf.o240858 1.4415e+01 8*a6244 lsf.o240859 1.5092e+01 8*a6237 lsf.o240860 1.3940e+01 8*a6204 lsf.o240861 1.5521e+01 8*a6276 lsf.o240903 1.3273e+01 8*a6234 lsf.o240904 1.6700e+01 8*a6206 lsf.o240905 1.4636e+01 8*a6269 lsf.o240906 1.5056e+01 8*a6234 ompirun -mca self,tcp lsf.o240948 1.8504e+01 8*a6234 lsf.o240949 1.9317e+01 8*a6207 lsf.o240950 1.8964e+01 8*a6234 lsf.o240951 2.0764e+01 8*a6207 ompirun -mca btl self,sm,openib lsf.o240998 1.3265e+01 8*a6269 lsf.o240999 1.2884e+01 8*a6269 lsf.o241000 1.3092e+01 8*a6234 lsf.o241001 1.3044e+01 8*a6269 ompirun -mca btl self,openib lsf.o241013 3.1572e+01 8*a6229 lsf.o241014 3.0552e+01 8*a6234 lsf.o241015 3.1813e+01 8*a6229 lsf.o241016 3.2514e+01 8*a6252 ompirun -mca btl self,sm lsf.o241044 1.3417e+01 8*a6234 lsf.o241045 killed 8*a6232 lsf.o241046 1.4626e+01 8*a6269 lsf.o241047 1.5060e+01 8*a6253 lsf.o241166 1.3179e+01 8*a6228 lsf.o241167 2.7759e+01 8*a6232 lsf.o241168 1.4224e+01 8*a6234 lsf.o241169 1.4825e+01 8*a6228 lsf.o241446 1.4896e+01 8*a6204 lsf.o241447 1.4960e+01 8*a6228 lsf.o241448 1.7622e+01 8*a6222 lsf.o241449 1.5112e+01 8*a6204 ompirun -mca btl self,tcp lsf.o241556 1.9135e+01 8*a6204 lsf.o241557 2.4365e+01 8*a6261 lsf.o241558 4.2682e+01 8*a6214 lsf.o241560 2.0481e+01 8*a6262 ompirun -mca btl self,sm,openib lsf.o241635 1.4234e+01 8*a6204 lsf.o241636 1.2024e+01 8*a6214 lsf.o241637 1.2773e+01 8*a6214 lsf.o241638 killed 8*a6214 lsf.o241684 1.8050e+01 8*a6261 lsf.o241686 1.3567e+01 8*a6203 lsf.o241687 1.5020e+01 8*a6228 lsf.o241688 2.2387e+01 8*a6225 ompirun -mca btl self,openib lsf.o241723 3.0060e+01 8*a6228 lsf.o241724 3.4366e+01 8*a6244 lsf.o241725 3.0033e+01 8*a6203 lsf.o241726 3.0499e+01 8*a6228 lsf.o241741 3.0483e+01 8*a6234 lsf.o241743 6.9527e+01 8*a6225 lsf.o241744 3.0945e+01 8*a6244 lsf.o241745 3.2120e+01 8*a6220 mvapich2 1.4.1 lsf.o243902 1.3661e+01 8*a6243 lsf.o244832 2.9471e+01 8*a6250 lsf.o244833 2.8425e+01 8*a6250 lsf.o244835 1.3644e+01 8*a6261 lsf.o244837 1.3793e+01 8*a6244 lsf.o244838 2.6907e+01 8*a6250 lsf.o247496 1.3632e+01 8*a6244 lsf.o247497 1.3368e+01 8*a6244 lsf.o247499 1.4120e+01 8*a6252 Any suggestions? Jed