Jonathan Dursi wrote:
So to summarize:
OpenMPI 1.3.2 + gcc4.4.0
Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
Default always hangs in Sendrecv after random number of iterations
Turning off sm (-mca btl self,tcp) not observed to hang
Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
Using fewer than 5 fifos hangs in Sendrecv after random number of
iterations or Finalize
OpenMPI 1.3.3 + gcc4.4.0
Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
Default sometimes (~20% of time) hangs in Sendrecv after random
number of iterations
Turning off sm (-mca btl self,tcp) not observed to hang
Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
Using fewer than 5 fifos but more than 2 not observed to hang
Using 2 fifos sometimes (~20% of time) hangs in Finalize or
Sendrecv after random number of iterations but sometimes completes
OpenMPI 1.3.2 + intel 11.0 compilers
We are seeing a problem which we believe to be related; ~1% of
certain single-node jobs hang, turning off sm or setting num_fifos to
NP-1 eliminates this.
I can reproduce this with just Barriers, which keeps the processes all
in sync. So, this has nothing to do with processes outrunning one
another (which wasn't likely in the first place given that you had
Sendrecv calls).
The problem is fickle. E.g., building OMPI with -g seems to make the
problem go away.
I did observe that the sm FIFO would fill up. That's weird since there
aren't ever a lot of in-flight messages. I tried adding a line of code
that would make a process pause if ever it tried to write to a FIFO that
seemed full. That pretty much made the problem go away. So, I guess
it's a memory coherency problem: receive clears the FIFO, but writer
thinks it's congested.
I tried all sorts of GCC compilers. The problem seems to set in with
4.4.0. I don't know what's significant about that. It requires moving
to the 2.18 assembler, but I tried the 2.18 assembler with 4.3.3 and
that worked okay. I'd think this has to do with GCC 4.4.x, but you say
you see the problem with Intel compilers as well. Hmm. Maybe an OMPI
problem that's better exposed with GCC 4.4.x?