Hi, Eugene:
If it continues to be a problem for people to reproduce this, I'll see
what can be done about having an account made here for someone to poke
around. Alternately, any suggestions for tests that I can do to help
diagnose/verify the problem, or figure out whats different about this
setup would be greatly appreciated.
As re the btl_sm_num_fifos thing, it could be a bit of a red herring,
it's just something I started to use following one of the previous bug
reports. However, it changes the behaviour pretty markedly - with
the sample program I submitted (eg, the send recvs looping around),
and with OpenMPI 1.3.2 (the version where I see the most extreme
problems, eg things fail every run), this always works
mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
and other larger numbers for num_fifos also seems to reliably work,
but 4 or less
mpirun -np 6 -mca btl_sm_num_fifos 4 -mca btl sm,self ./diffusion-mpi
always hangs, as before - after some number of iterations, sometimes
fewer, sometimes more, always somewhere in the MPI_Sendrecv:
(gdb) where
#0 0x00002b9b0a661e80 in opal_progress@plt () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#1 0x00002b9b0a67e345 in ompi_request_default_wait () from /scinet/
gpc/mpi/openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#2 0x00002b9b0a6a42c0 in PMPI_Sendrecv () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#3 0x00002b9b0a43c540 in pmpi_sendrecv__ () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#4 0x0000000000400eab in MAIN__ ()
#5 0x0000000000400fda in main (argc=1, argv=0x7fffb92cc078)
at ../../../gcc-4.4.0/libgfortran/fmain.c:21
On the other hand, if I set the leftmost and rightmost neighbours to
MPI_PROC_NULL as Jeff requested, the behaviour changes; any number
greater than two works
mpirun -np 6 -mca btl_sm_num_fifos 3 -mca btl sm,self ./diffusion-mpi
But the btl_sm_num_fifos 2 always hangs, either in the Sendrecv or in
the Finalize
mpirun -np 6 -mca btl_sm_num_fifos 2 -mca btl sm,self ./diffusion-mpi
And the default always hangs, usually in the Finalize but sometimes in
the Sendrecv.
mpirun -np 6 -mca btl sm,self ./diffusion-mpi
(gdb) where
#0 0x00002ad54846d51f in poll () from /lib64/libc.so.6
#1 0x00002ad54717a7c1 in poll_dispatch () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#2 0x00002ad547179659 in opal_event_base_loop () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#3 0x00002ad54716e189 in opal_progress () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#4 0x00002ad54931ef15 in barrier () from /scinet/gpc/mpi/openmpi/
1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_grpcomm_bad.so
#5 0x00002ad546ca358b in ompi_mpi_finalize () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#6 0x00002ad546a5d529 in pmpi_finalize__ () from /scinet/gpc/mpi/
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0
#7 0x0000000000400f99 in MAIN__ ()
So to summarize:
OpenMPI 1.3.2 + gcc4.4.0
Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
Default always hangs in Sendrecv after random number of iterations
Turning off sm (-mca btl self,tcp) not observed to hang
Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
Using fewer than 5 fifos hangs in Sendrecv after random number of
iterations or Finalize
Test problem with non-periodic (left neighbour of proc 0 is
MPI_PROC_NULL) Sendrecv()s:
Default always hangs, in Sendrecv after random number of iterations
or at Finalize
Turning off sm (-mca btl self,tcp) not observed to hang
Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
Using fewer than 5 fifos but more than 2 not observed to hang
Using 2 fifos hangs in Finalize or Sendrecv after random number of
iterations
OpenMPI 1.3.3 + gcc4.4.0
Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
Default sometimes (~20% of time) hangs in Sendrecv after random
number of iterations
Turning off sm (-mca btl self,tcp) not observed to hang
Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
Using fewer than 5 fifos but more than 2 not observed to hang
Using 2 fifos sometimes (~20% of time) hangs in Finalize or
Sendrecv after random number of iterations but sometimes completes
Test problem with non-periodic (left neighbour of proc 0 is
MPI_PROC_NULL) Sendrecv()s:
Default usually (~75% of time) hangs, in Finalize or in Sendrecv
after random number of iterations
Turning off sm (-mca btl self,tcp) not observed to hang
Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
Using fewer than 5 fifos but more than 2 not observed to hang
Using 2 fifos usually (~75% of time) hangs in Finalize or Sendrecv
after random number of iterations but sometimes completes
OpenMPI 1.3.2 + intel 11.0 compilers
We are seeing a problem which we believe to be related; ~1% of certain
single-node jobs hang, turning off sm or setting num_fifos to NP-1
eliminates this.
- Jonathan
--
Jonathan Dursi <ljdu...@scinet.utoronto.ca>