Jonathan Dursi wrote:
Continuing the conversation with myself:
Google pointed me to Trac ticket #1944, which spoke of deadlocks in
looped collective operations; there is no collective operation
anywhere in this sample code, but trying one of the suggested
workarounds/clues: that is, setting btl_sm_num_fifos to at least
(np-1) seems to make things work quite reliably, for both OpenMPI
1.3.2 and 1.3.3; that is, while this
mpirun -np 6 -mca btl sm,self ./diffusion-mpi
invariably hangs (at random-seeming numbers of iterations) with
OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again
seemingly randomly) with 1.3.3,
mpirun -np 6 -mca btl tcp,self ./diffusion-mpi
or
mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
always succeeds, with (as one might guess) the second being much
faster...
The btl_sm_num_fifos thing doesn't on the surface make much sense to
me. That presumably controls the number of receive FIFOs per process.
The default became 1, which could threaten to change behavior if
multiple senders all send to the same FIFO. But your sample program has
just one-to-one connections. Each receiver has only one sender. So,
the number of FIFOs shouldn't matter. Bumping the number up only means
you allocate some FIFOs that are never used.
Hmm. Continuing the conversation with myself, maybe that's not entirely
true. Whatever fragments are sent by a process must be received back
from the receiver. So, a process receives not only messages from its
left but also return fragments from its right. Still, why would np-1
FIFOs be needed? Why not just 2?
And, as Jeff points out, everyone should be staying in pretty good sync
with the Sendrecv pattern. So, how could there be a problem at all?
Like Jeff, my attempts so far to reproduce the problem (with
hardware/software conveniently accessible to me) have come up empty.