On Mon, Apr 06, 2009 at 02:04:16PM -0700, Eugene Loh wrote: > Steve Kargl wrote: > > >I recently upgraded OpenMPI from 1.2.9 to 1.3 and then 1.3.1. > >One of my colleagues reported a dramatic drop in performance > >with one of his applications. My investigation shows a factor > >of 10 drop in communication over the memory bus. I've placed > >a figure that iilustrates the problem at > > > >http://troutmask.apl.washington.edu/~kargl/ompi_cmp.jpg > > > >The legend in the figure has 'ver. 1.2.9 11 <--> 18'. This > >means communication between node 11 and node 18 over GigE > >ethernet in my cluster. 'ver. 1.2.9 20 <--> 20' means > >communication between processes on node 20 where node 20 has > >8 processors. The image clearly shows > > > Not so clearly in my mind since I have trouble discriminating between > the colors and the overlapping lines and so on. But I'll take your word > for it that the plot illustrates the point you are reporting.
OK. I've removed the GigE results in the graph and plotted with points as well as lines. You'll see a red line by itself. The green and blue lines overlap. The original data is now http://troutmask.apl.washington.edu/~kargl/ompi_cmp_new.jpg > It appears that you used to have just better than 1-usec latency (which > is reasonable), but then it skyrocketed just over 10x with 1.3. I did > some sm work, but that first appears in 1.3.2. According to netpipe, I have version 1.3.1 0: node20.cimu.org 1: node20.cimu.org Latency: 0.000009131 Sync Time: 0.000018241 Now starting main loop version 1.2.9 0: node20.cimu.org 1: node20.cimu.org Latency: 0.000000669 Sync Time: 0.000001811 So, the latency has indeed gone up. > The huge sm latencies are, so far as I know, inconsistent with > everyone else's experience with 1.3. Is there any chance you > could rebuild all three versions and really confirm that the > observed difference can actually be attributed to differences > in the OMPI source code? And/or run with "--mca btl > self,sm" to make sure that the on-node message passing is indeed using sm? > The command lines I used are /usr/local/openmpi-1.2.9/bin/mpicc -o z -O -static GetOpt.c netmpi.c /usr/local/openmpi-1.2.9/bin/mpiexec -machinefile mf_ompi_2 -n 2 ./z /usr/local/openmpi-1.3.1/bin/mpicc -o z -O -static GetOpt.c netmpi.c /usr/local/openmpi-1.3.1/bin/mpiexec --mca btl self,sm -machinefile \ mf_ompi_2 -n 2 ./z There is no change in the results as can be seen at http://troutmask.apl.washington.edu/~kargl/ompi_cmp_self.sm.jpg The machinefile contains the single line 'node20.cimu.org slots=2'. I can rebuild 1.2.9 and 1.3.1. Is there any particular configure options that I should enable/disable? -- Steve