Johnathan --

Sorry for the delay in replying; thanks for posting again.

I'm actually unable to replicate your problem. :-( I have a new intel 8 core X5570 box; I'm running at np6 and np8 on both Open MPI 1.3.2 and 1.3.3 and am not seeing the problem you're seeing. I even made your sample program worse -- I made a and b be 100,000 element real arrays (increasing the count args in MPI_SENDRECV to 100,000 as well), and increased nsteps to 150,000,000. No hangs. :-\

The version of the compiler *usually* isn't significant, so gcc 4.x should be fine.

Yes, the sm flow control issue was a significant fix, but the blocking nature of MPI_SENDRECV means that you shouldn't have run into the problems that were fixed (the main issues had to do with fast senders exhausting resources of slow receivers -- but MPI_SENDRECV is synchronous so the senders should always be matching the speed of the receivers).

Just for giggles, what happens if you change

      if (leftneighbour .eq. -1) then
         leftneighbour = nprocs-1
      endif
      if (rightneighbour .eq. nprocs) then
         rightneighbour = 0
      endif

to

      if (leftneighbour .eq. -1) then
         leftneighbour = MPI_PROC_NULL
      endif
      if (rightneighbour .eq. nprocs) then
         rightneighbour = MPI_PROC_NULL
      endif



On Sep 21, 2009, at 5:09 PM, Jonathan Dursi wrote:

Continuing the conversation with myself:

Google pointed me to Trac ticket #1944, which spoke of deadlocks in looped collective operations; there is no collective operation anywhere in this sample code, but trying one of the suggested workarounds/clues: that is, setting btl_sm_num_fifos to at least (np-1) seems to make things work quite reliably, for both OpenMPI 1.3.2 and 1.3.3; that is, while this

mpirun -np 6 -mca btl sm,self ./diffusion-mpi

invariably hangs (at random-seeming numbers of iterations) with OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again seemingly randomly) with 1.3.3,

mpirun -np 6 -mca btl tcp,self ./diffusion-mpi

or

mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi

always succeeds, with (as one might guess) the second being much faster...

        Jonathan

--
Jonathan Dursi     <ljdu...@scinet.utoronto.ca>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com

Reply via email to