Hi, Jeff:
I wish I had your problems reproducing this. This problem apparently
rears its head when OpenMPI is compiled with the intel compilers, as
well, but only ~1% of the time. Unfortunately, we have users who
launch ~1400 single-node jobs at a go. So they see on order a dozen
or two jobs hang per suite of simulations when using the defaults, but
their problem goes away when they use -mca btl self,tcp, or when they
use sm but set the number of fifos to np-1.
At first I had assumed it was a new-ish-architecture thing, as we
first saw the problem on the Nehalem Xeon E5540 nodes, but the sample
program hangs in exactly the same way on a Harpertown (E5430) machine
as well. So I've been assuming that this is a real problem that for
whatever reason is just exposed more with this particular version of
this particular compiler. I'd love to be wrong and for it to be
something strange but easily changed in our environment that is
causing this.
Running with your suggested test change, eg
leftneighbour = rank-1
if (leftneighbour .eq. -1) then
! leftneighbour = nprocs-1
leftneighbour = MPI_PROC_NULL
endif
rightneighbour = rank+1
if (rightneighbour .eq. nprocs) then
! rightneighbour = 0
rightneighbour = MPI_PROC_NULL
endif
like so:
mpirun -np 6 -mca btl self,sm,tcp ./diffusion-mpi
I do seem to get different behaviour. With OpenMPI 1.3.2, the program
frequently runs to completion, but when it does so it hangs at the
end, which hadn't happened before -- attaching gdb to a process tells
me that it's hanging in mpi_finalize;
(gdb) where
#0 0x00002b3635ecb51f in poll () from /lib64/libc.so.6
#1 0x00002b3634bd87c1 in poll_dispatch () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#2 0x00002b3634bd7659 in opal_event_base_loop () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#3 0x00002b3634bcc189 in opal_progress () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#4 0x00002b3636d7cf15 in barrier () from /scinet/gpc/mpi/openmpi/
1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_grpcomm_bad.so
#5 0x00002b363470158b in ompi_mpi_finalize () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#6 0x00002b36344bb529 in pmpi_finalize__ () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#7 0x0000000000400f99 in MAIN__ ()
#8 0x0000000000400fda in main (argc=1, argv=0x7fff3e3908c8)
at ../../../gcc-4.4.0/libgfortran/fmain.c:21
(gdb)
The rest of the time (maybe 1/4 of the time?) it hangs mid-run, in
the sendrecv:
(gdb) where
#0 0x00002b2bb44b4230 in mca_pml_ob1_send () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_pml_ob1.so
#1 0x00002b2baf47d296 in PMPI_Sendrecv () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#2 0x00002b2baf215540 in pmpi_sendrecv__ () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#3 0x0000000000400ea6 in MAIN__ ()
#4 0x0000000000400fda in main (argc=1, argv=0x7fff62d9b9c8)
at ../../../gcc-4.4.0/libgfortran/fmain.c:21
When running with OpenMPI 1.3.3, I get hangs in the program
significantly _more_ often with this change than before, typically in
the sendrecv again
#0 0x00002aeb89d6cf2b in mca_btl_sm_component_progress () from /
scinet/gpc/mpi/openmpi/1.3.3-gcc-v4.4.0-ofed/lib/openmpi/mca_btl_sm.so
#1 0x00002aeb849bd14a in opal_progress () from /scinet/gpc/mpi/
openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#2 0x00002aeb8954f235 in mca_pml_ob1_send () from /scinet/gpc/mpi/
openmpi/1.3.3-gcc-v4.4.0-ofed/lib/openmpi/mca_pml_ob1.so
#3 0x00002aeb84516586 in PMPI_Sendrecv () from /scinet/gpc/mpi/
openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libmpi.so.0
#4 0x00002aeb842ae5b0 in pmpi_sendrecv__ () from /scinet/gpc/mpi/
openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#5 0x0000000000400ea6 in MAIN__ ()
#6 0x0000000000400fda in main (argc=1, argv=0x7fff12a13068)
at ../../../gcc-4.4.0/libgfortran/fmain.c:21
but again occasionally in the finalize, and (unlike with 1.3.2)
occasional successful runs through completion.
Again, running the program with both versions of openmpi without sm
mpirun -np 6 -mca btl self,tcp ./diffusion-mpi
or with num_fifos=(np-1):
mpirun -np 6 -mca btl self,sm -mca btl_sm_num_fifos 5 ./diffusion-mpi
seems to work fine.
- Jonathan
On 2009-09-22, at 8:52PM, Jeff Squyres wrote:
Johnathan --
Sorry for the delay in replying; thanks for posting again.
I'm actually unable to replicate your problem. :-( I have a new
intel 8 core X5570 box; I'm running at np6 and np8 on both Open MPI
1.3.2 and 1.3.3 and am not seeing the problem you're seeing. I even
made your sample program worse -- I made a and b be 100,000 element
real arrays (increasing the count args in MPI_SENDRECV to 100,000 as
well), and increased nsteps to 150,000,000. No hangs. :-\
The version of the compiler *usually* isn't significant, so gcc 4.x
should be fine.
Yes, the sm flow control issue was a significant fix, but the
blocking nature of MPI_SENDRECV means that you shouldn't have run
into the problems that were fixed (the main issues had to do with
fast senders exhausting resources of slow receivers -- but
MPI_SENDRECV is synchronous so the senders should always be matching
the speed of the receivers).
Just for giggles, what happens if you change
if (leftneighbour .eq. -1) then
leftneighbour = nprocs-1
endif
if (rightneighbour .eq. nprocs) then
rightneighbour = 0
endif
to
if (leftneighbour .eq. -1) then
leftneighbour = MPI_PROC_NULL
endif
if (rightneighbour .eq. nprocs) then
rightneighbour = MPI_PROC_NULL
endif
On Sep 21, 2009, at 5:09 PM, Jonathan Dursi wrote:
Continuing the conversation with myself:
Google pointed me to Trac ticket #1944, which spoke of deadlocks in
looped collective operations; there is no collective operation
anywhere in this sample code, but trying one of the suggested
workarounds/clues: that is, setting btl_sm_num_fifos to at least
(np-1) seems to make things work quite reliably, for both OpenMPI
1.3.2 and 1.3.3; that is, while this
mpirun -np 6 -mca btl sm,self ./diffusion-mpi
invariably hangs (at random-seeming numbers of iterations) with
OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again
seemingly randomly) with 1.3.3,
mpirun -np 6 -mca btl tcp,self ./diffusion-mpi
or
mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
always succeeds, with (as one might guess) the second being much
faster...
Jonathan
--
Jonathan Dursi <ljdu...@scinet.utoronto.ca>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jonathan Dursi <ljdu...@scinet.utoronto.ca>