Hi, Jeff:

I wish I had your problems reproducing this. This problem apparently rears its head when OpenMPI is compiled with the intel compilers, as well, but only ~1% of the time. Unfortunately, we have users who launch ~1400 single-node jobs at a go. So they see on order a dozen or two jobs hang per suite of simulations when using the defaults, but their problem goes away when they use -mca btl self,tcp, or when they use sm but set the number of fifos to np-1.

At first I had assumed it was a new-ish-architecture thing, as we first saw the problem on the Nehalem Xeon E5540 nodes, but the sample program hangs in exactly the same way on a Harpertown (E5430) machine as well. So I've been assuming that this is a real problem that for whatever reason is just exposed more with this particular version of this particular compiler. I'd love to be wrong and for it to be something strange but easily changed in our environment that is causing this.

Running with your suggested test change, eg
       leftneighbour = rank-1
       if (leftneighbour .eq. -1) then
!          leftneighbour = nprocs-1
          leftneighbour = MPI_PROC_NULL
       endif
       rightneighbour = rank+1
       if (rightneighbour .eq. nprocs) then
!          rightneighbour = 0
          rightneighbour = MPI_PROC_NULL
       endif

like so:
mpirun -np 6 -mca btl self,sm,tcp ./diffusion-mpi

I do seem to get different behaviour. With OpenMPI 1.3.2, the program frequently runs to completion, but when it does so it hangs at the end, which hadn't happened before -- attaching gdb to a process tells me that it's hanging in mpi_finalize;
(gdb) where
#0  0x00002b3635ecb51f in poll () from /lib64/libc.so.6
#1 0x00002b3634bd87c1 in poll_dispatch () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #2 0x00002b3634bd7659 in opal_event_base_loop () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #3 0x00002b3634bcc189 in opal_progress () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #4 0x00002b3636d7cf15 in barrier () from /scinet/gpc/mpi/openmpi/ 1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_grpcomm_bad.so #5 0x00002b363470158b in ompi_mpi_finalize () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0 #6 0x00002b36344bb529 in pmpi_finalize__ () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#7  0x0000000000400f99 in MAIN__ ()
#8 0x0000000000400fda in main (argc=1, argv=0x7fff3e3908c8) at ../../../gcc-4.4.0/libgfortran/fmain.c:21
(gdb)

The rest of the time (maybe 1/4 of the time?) it hangs mid-run, in the sendrecv:
(gdb) where
#0 0x00002b2bb44b4230 in mca_pml_ob1_send () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_pml_ob1.so #1 0x00002b2baf47d296 in PMPI_Sendrecv () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0 #2 0x00002b2baf215540 in pmpi_sendrecv__ () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#3  0x0000000000400ea6 in MAIN__ ()
#4 0x0000000000400fda in main (argc=1, argv=0x7fff62d9b9c8) at ../../../gcc-4.4.0/libgfortran/fmain.c:21


When running with OpenMPI 1.3.3, I get hangs in the program significantly _more_ often with this change than before, typically in the sendrecv again

#0 0x00002aeb89d6cf2b in mca_btl_sm_component_progress () from / scinet/gpc/mpi/openmpi/1.3.3-gcc-v4.4.0-ofed/lib/openmpi/mca_btl_sm.so #1 0x00002aeb849bd14a in opal_progress () from /scinet/gpc/mpi/ openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #2 0x00002aeb8954f235 in mca_pml_ob1_send () from /scinet/gpc/mpi/ openmpi/1.3.3-gcc-v4.4.0-ofed/lib/openmpi/mca_pml_ob1.so #3 0x00002aeb84516586 in PMPI_Sendrecv () from /scinet/gpc/mpi/ openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libmpi.so.0 #4 0x00002aeb842ae5b0 in pmpi_sendrecv__ () from /scinet/gpc/mpi/ openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#5  0x0000000000400ea6 in MAIN__ ()
#6 0x0000000000400fda in main (argc=1, argv=0x7fff12a13068) at ../../../gcc-4.4.0/libgfortran/fmain.c:21

but again occasionally in the finalize, and (unlike with 1.3.2) occasional successful runs through completion.

Again, running the program with both versions of openmpi without sm
mpirun -np 6 -mca btl self,tcp  ./diffusion-mpi

or with num_fifos=(np-1):
mpirun -np 6 -mca btl self,sm -mca btl_sm_num_fifos 5 ./diffusion-mpi

seems to work fine.

        - Jonathan

On 2009-09-22, at 8:52PM, Jeff Squyres wrote:

Johnathan --

Sorry for the delay in replying; thanks for posting again.

I'm actually unable to replicate your problem. :-( I have a new intel 8 core X5570 box; I'm running at np6 and np8 on both Open MPI 1.3.2 and 1.3.3 and am not seeing the problem you're seeing. I even made your sample program worse -- I made a and b be 100,000 element real arrays (increasing the count args in MPI_SENDRECV to 100,000 as well), and increased nsteps to 150,000,000. No hangs. :-\

The version of the compiler *usually* isn't significant, so gcc 4.x should be fine.

Yes, the sm flow control issue was a significant fix, but the blocking nature of MPI_SENDRECV means that you shouldn't have run into the problems that were fixed (the main issues had to do with fast senders exhausting resources of slow receivers -- but MPI_SENDRECV is synchronous so the senders should always be matching the speed of the receivers).

Just for giggles, what happens if you change

     if (leftneighbour .eq. -1) then
        leftneighbour = nprocs-1
     endif
     if (rightneighbour .eq. nprocs) then
        rightneighbour = 0
     endif

to

     if (leftneighbour .eq. -1) then
        leftneighbour = MPI_PROC_NULL
     endif
     if (rightneighbour .eq. nprocs) then
        rightneighbour = MPI_PROC_NULL
     endif



On Sep 21, 2009, at 5:09 PM, Jonathan Dursi wrote:

Continuing the conversation with myself:

Google pointed me to Trac ticket #1944, which spoke of deadlocks in looped collective operations; there is no collective operation anywhere in this sample code, but trying one of the suggested workarounds/clues: that is, setting btl_sm_num_fifos to at least (np-1) seems to make things work quite reliably, for both OpenMPI 1.3.2 and 1.3.3; that is, while this

mpirun -np 6 -mca btl sm,self ./diffusion-mpi

invariably hangs (at random-seeming numbers of iterations) with OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again seemingly randomly) with 1.3.3,

mpirun -np 6 -mca btl tcp,self ./diffusion-mpi

or

mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi

always succeeds, with (as one might guess) the second being much faster...

        Jonathan

--
Jonathan Dursi     <ljdu...@scinet.utoronto.ca>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jonathan Dursi <ljdu...@scinet.utoronto.ca>




Reply via email to