Steve Kargl wrote:
I recently upgraded OpenMPI from 1.2.9 to 1.3 and then 1.3.1.
One of my colleagues reported a dramatic drop in performance
with one of his applications. My investigation shows a factor
of 10 drop in communication over the memory bus. I've placed
a figure that iilustrates the problem at
http://troutmask.apl.washington.edu/~kargl/ompi_cmp.jpg
The legend in the figure has 'ver. 1.2.9 11 <--> 18'. This
means communication between node 11 and node 18 over GigE
ethernet in my cluster. 'ver. 1.2.9 20 <--> 20' means
communication between processes on node 20 where node 20 has
8 processors. The image clearly shows
Not so clearly in my mind since I have trouble discriminating between
the colors and the overlapping lines and so on. But I'll take your word
for it that the plot illustrates the point you are reporting.
It appears that you used to have just better than 1-usec latency (which
is reasonable), but then it skyrocketed just over 10x with 1.3. I did
some sm work, but that first appears in 1.3.2. The huge sm latencies
are, so far as I know, inconsistent with everyone else's experience with
1.3. Is there any chance you could rebuild all three versions and
really confirm that the observed difference can actually be attributed
to differences in the OMPI source code? And/or run with "--mca btl
self,sm" to make sure that the on-node message passing is indeed using sm?
that communication over
GigE is consistent among the versions of OpenMPI. However, some
change in going from 1.2.9 to 1.3.x is causing a drop in
communication between processes on a single node.
Things to note. Nodes 11, 18, and 20 are essentially idle
before and after a test. configure was run with the same set
of options except with 1.3 and 1.3.1 I needed to disable ipv6:
./configure --prefix=/usr/local/openmpi-1.2.9 \
--enable-orterun-prefix-by-default --enable-static
--disable-shared
./configure --prefix=/usr/local/openmpi-1.3.1 \
--enable-orterun-prefix-by-default --enable-static
--disable-shared --disable-ipv6
./configure --prefix=/usr/local/openmpi-1.3.1 \
--enable-orterun-prefix-by-default --enable-static
--disable-shared --disable-ipv6
The operating system is FreeBSD 8.0 where nodes 18 and 20
are quad-core, dual-cpu opteron based systems and node 11
is a dual-core, dual-cpu opteron based system. For additional
information, I've placed the output of ompi_info at
http://troutmask.apl.washington.edu/~kargl/ompi_info-1.2.9
http://troutmask.apl.washington.edu/~kargl/ompi_info-1.3.0
http://troutmask.apl.washington.edu/~kargl/ompi_info-1.3.1
Any hints on tuning 1.3.1 would be appreciated?