Hi, Eugene:

If it continues to be a problem for people to reproduce this, I'll see what can be done about having an account made here for someone to poke around. Alternately, any suggestions for tests that I can do to help diagnose/verify the problem, or figure out whats different about this setup would be greatly appreciated.

As re the btl_sm_num_fifos thing, it could be a bit of a red herring, it's just something I started to use following one of the previous bug reports. However, it changes the behaviour pretty markedly - with the sample program I submitted (eg, the send recvs looping around), and with OpenMPI 1.3.2 (the version where I see the most extreme problems, eg things fail every run), this always works

mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi

and other larger numbers for num_fifos also seems to reliably work, but 4 or less

mpirun -np 6 -mca btl_sm_num_fifos 4 -mca btl sm,self ./diffusion-mpi

always hangs, as before - after some number of iterations, sometimes fewer, sometimes more, always somewhere in the MPI_Sendrecv:
(gdb) where
#0 0x00002b9b0a661e80 in opal_progress@plt () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0 #1 0x00002b9b0a67e345 in ompi_request_default_wait () from /scinet/ gpc/mpi/openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0 #2 0x00002b9b0a6a42c0 in PMPI_Sendrecv () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0 #3 0x00002b9b0a43c540 in pmpi_sendrecv__ () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#4  0x0000000000400eab in MAIN__ ()
#5 0x0000000000400fda in main (argc=1, argv=0x7fffb92cc078) at ../../../gcc-4.4.0/libgfortran/fmain.c:21

On the other hand, if I set the leftmost and rightmost neighbours to MPI_PROC_NULL as Jeff requested, the behaviour changes; any number greater than two works

mpirun -np 6 -mca btl_sm_num_fifos 3 -mca btl sm,self ./diffusion-mpi

But the btl_sm_num_fifos 2 always hangs, either in the Sendrecv or in the Finalize

mpirun -np 6 -mca btl_sm_num_fifos 2 -mca btl sm,self ./diffusion-mpi

And the default always hangs, usually in the Finalize but sometimes in the Sendrecv.

mpirun -np 6 -mca btl sm,self ./diffusion-mpi
(gdb) where
#0  0x00002ad54846d51f in poll () from /lib64/libc.so.6
#1 0x00002ad54717a7c1 in poll_dispatch () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #2 0x00002ad547179659 in opal_event_base_loop () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #3 0x00002ad54716e189 in opal_progress () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #4 0x00002ad54931ef15 in barrier () from /scinet/gpc/mpi/openmpi/ 1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_grpcomm_bad.so #5 0x00002ad546ca358b in ompi_mpi_finalize () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0 #6 0x00002ad546a5d529 in pmpi_finalize__ () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#7  0x0000000000400f99 in MAIN__ ()


So to summarize:

OpenMPI 1.3.2 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1) Sendrecv()s:
 Default always hangs in Sendrecv after random number of iterations
 Turning off sm (-mca btl self,tcp) not observed to hang
 Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
Using fewer than 5 fifos hangs in Sendrecv after random number of iterations or Finalize

Test problem with non-periodic (left neighbour of proc 0 is MPI_PROC_NULL) Sendrecv()s: Default always hangs, in Sendrecv after random number of iterations or at Finalize
 Turning off sm (-mca btl self,tcp) not observed to hang
 Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
 Using fewer than 5 fifos but more than 2 not observed to hang
Using 2 fifos hangs in Finalize or Sendrecv after random number of iterations

OpenMPI 1.3.3 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1) Sendrecv()s: Default sometimes (~20% of time) hangs in Sendrecv after random number of iterations
 Turning off sm (-mca btl self,tcp) not observed to hang
 Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
 Using fewer than 5 fifos but more than 2 not observed to hang
Using 2 fifos sometimes (~20% of time) hangs in Finalize or Sendrecv after random number of iterations but sometimes completes

Test problem with non-periodic (left neighbour of proc 0 is MPI_PROC_NULL) Sendrecv()s: Default usually (~75% of time) hangs, in Finalize or in Sendrecv after random number of iterations
 Turning off sm (-mca btl self,tcp) not observed to hang
 Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
 Using fewer than 5 fifos but more than 2 not observed to hang
Using 2 fifos usually (~75% of time) hangs in Finalize or Sendrecv after random number of iterations but sometimes completes

OpenMPI 1.3.2 + intel 11.0 compilers

We are seeing a problem which we believe to be related; ~1% of certain single-node jobs hang, turning off sm or setting num_fifos to NP-1 eliminates this.

   - Jonathan
--
Jonathan Dursi <ljdu...@scinet.utoronto.ca>




Reply via email to