Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

Eugene Loh Wed, 23 Sep 2009 00:56:44 -0400

Jonathan Dursi wrote:

Continuing the conversation with myself:
Google pointed me to Trac ticket #1944, which spoke of deadlocks inlooped collective operations; there is no collective operationanywhere in this sample code, but trying one of the suggestedworkarounds/clues: that is, setting btl_sm_num_fifos to at least(np-1) seems to make things work quite reliably, for both OpenMPI1.3.2 and 1.3.3; that is, while this
mpirun -np 6 -mca btl sm,self ./diffusion-mpi
invariably hangs (at random-seeming numbers of iterations) withOpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, againseemingly randomly) with 1.3.3,
mpirun -np 6 -mca btl tcp,self ./diffusion-mpi

or

mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
always succeeds, with (as one might guess) the second being muchfaster...

The btl_sm_num_fifos thing doesn't on the surface make much sense tome. That presumably controls the number of receive FIFOs per process.The default became 1, which could threaten to change behavior ifmultiple senders all send to the same FIFO. But your sample program hasjust one-to-one connections. Each receiver has only one sender. So,the number of FIFOs shouldn't matter. Bumping the number up only meansyou allocate some FIFOs that are never used.

Hmm. Continuing the conversation with myself, maybe that's not entirelytrue. Whatever fragments are sent by a process must be received backfrom the receiver. So, a process receives not only messages from itsleft but also return fragments from its right. Still, why would np-1FIFOs be needed? Why not just 2?

And, as Jeff points out, everyone should be staying in pretty good syncwith the Sendrecv pattern. So, how could there be a problem at all?

Like Jeff, my attempts so far to reproduce the problem (withhardware/software conveniently accessible to me) have come up empty.

Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

Reply via email to