Jonathan Dursi wrote:

Continuing the conversation with myself:

Google pointed me to Trac ticket #1944, which spoke of deadlocks in looped collective operations; there is no collective operation anywhere in this sample code, but trying one of the suggested workarounds/clues: that is, setting btl_sm_num_fifos to at least (np-1) seems to make things work quite reliably, for both OpenMPI 1.3.2 and 1.3.3; that is, while this

mpirun -np 6 -mca btl sm,self ./diffusion-mpi

invariably hangs (at random-seeming numbers of iterations) with OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again seemingly randomly) with 1.3.3,

mpirun -np 6 -mca btl tcp,self ./diffusion-mpi

or

mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi

always succeeds, with (as one might guess) the second being much faster...

The btl_sm_num_fifos thing doesn't on the surface make much sense to me. That presumably controls the number of receive FIFOs per process. The default became 1, which could threaten to change behavior if multiple senders all send to the same FIFO. But your sample program has just one-to-one connections. Each receiver has only one sender. So, the number of FIFOs shouldn't matter. Bumping the number up only means you allocate some FIFOs that are never used.

Hmm. Continuing the conversation with myself, maybe that's not entirely true. Whatever fragments are sent by a process must be received back from the receiver. So, a process receives not only messages from its left but also return fragments from its right. Still, why would np-1 FIFOs be needed? Why not just 2?

And, as Jeff points out, everyone should be staying in pretty good sync with the Sendrecv pattern. So, how could there be a problem at all?

Like Jeff, my attempts so far to reproduce the problem (with hardware/software conveniently accessible to me) have come up empty.

Reply via email to