Johnathan --
Sorry for the delay in replying; thanks for posting again.
I'm actually unable to replicate your problem. :-( I have a new
intel 8 core X5570 box; I'm running at np6 and np8 on both Open MPI
1.3.2 and 1.3.3 and am not seeing the problem you're seeing. I even
made your sample program worse -- I made a and b be 100,000 element
real arrays (increasing the count args in MPI_SENDRECV to 100,000 as
well), and increased nsteps to 150,000,000. No hangs. :-\
The version of the compiler *usually* isn't significant, so gcc 4.x
should be fine.
Yes, the sm flow control issue was a significant fix, but the blocking
nature of MPI_SENDRECV means that you shouldn't have run into the
problems that were fixed (the main issues had to do with fast senders
exhausting resources of slow receivers -- but MPI_SENDRECV is
synchronous so the senders should always be matching the speed of the
receivers).
Just for giggles, what happens if you change
if (leftneighbour .eq. -1) then
leftneighbour = nprocs-1
endif
if (rightneighbour .eq. nprocs) then
rightneighbour = 0
endif
to
if (leftneighbour .eq. -1) then
leftneighbour = MPI_PROC_NULL
endif
if (rightneighbour .eq. nprocs) then
rightneighbour = MPI_PROC_NULL
endif
On Sep 21, 2009, at 5:09 PM, Jonathan Dursi wrote:
Continuing the conversation with myself:
Google pointed me to Trac ticket #1944, which spoke of deadlocks in
looped collective operations; there is no collective operation
anywhere in this sample code, but trying one of the suggested
workarounds/clues: that is, setting btl_sm_num_fifos to at least
(np-1) seems to make things work quite reliably, for both OpenMPI
1.3.2 and 1.3.3; that is, while this
mpirun -np 6 -mca btl sm,self ./diffusion-mpi
invariably hangs (at random-seeming numbers of iterations) with
OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again
seemingly randomly) with 1.3.3,
mpirun -np 6 -mca btl tcp,self ./diffusion-mpi
or
mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
always succeeds, with (as one might guess) the second being much
faster...
Jonathan
--
Jonathan Dursi <ljdu...@scinet.utoronto.ca>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com