On 2010-04-20, at 9:18AM, Terry Dontje wrote: > Hi Jonathan, > > Do you know what the top level function is or communication pattern? Is it > some type of collective or a pattern that has a many to one.
Ah, should have mentioned. The best-characterized code that we're seeing this with is an absolutely standard (logically) regular grid hydrodynamics code, only does nearest neighbour communication for exchanging guardcells; the Wait in this case is, I think, just a matter of overlapping communication with computation of the inner zones. There are things like allreduces in there, as well, for setting timesteps, but the communication pattern is overall extremely regular and well-behaved. > What might be happening is that since OMPI uses a lazy connections by default > if all processes are trying to establish communications to the same process > you might run into the below. > > You might want to see if setting "--mca mpi_preconnect_all 1" helps any. But > beware this will cause your startup to increase. However, this might give us > insight as to whether the problem is flooding a single rank with connect > requests. I'm certainly willing to try it. - Jonathan -- Jonathan Dursi <ljdu...@scinet.utoronto.ca>