On Wed, 2009-12-02 at 13:11 -0500, Brock Palen wrote: > On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote: > > On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote: > >> The attached code, is an example where openmpi/1.3.2 will lock up, if > >> ran on 48 cores, of IB (4 cores per node), > >> The code loops over recv from all processors on rank 0 and sends from > >> all other ranks, as far as I know this should work, and I can't see > >> why not. > >> Note yes I know we can do the same thing with a gather, this is a > >> simple case to demonstrate the issue. > >> Note that if I increase the openib eager limit, the program runs, > >> which normally means improper MPI, but I can't on my own figure out > >> the problem with this code. > > > > What are you increasing the eager limit from and too? > > The same value as ethernet on our system, > mpirun --mca btl_openib_eager_limit 655360 --mca > btl_openib_max_send_size 655360 ./a.out > > Huge values compared to the defaults, but works,
My understanding of the code is that each message will be 256k long and the code pretty much guarantees that at some point there will be 46 messages in the queue in front of the one you are looking to receive which makes a total of 11.5Mb, slightly less if you take shared memory into account. If the MPI_SEND isn't blocking then each rank will send 50 messages to rank zero and you'll have 2000 messages and 500Mb of data being received with the message you want being somewhere towards the end of the queue. These numbers are far from huge but then compared to an eager limit of 64k they aren't small either. I suspect the eager limit is being reached on COMM_WORLD rank 0 and it's not pulling any more messages off the network pending some of the existing ones being out of the queue but they never will be because the message being waited for is one that's stuck on the network. As I say the message queue for rank 0 when it's deadlocked would be interesting to look at. In summary this code makes heavy use of unexpected messages and network buffering, it's not surprising to me that it only works with eager limits set fairly high. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk