On Wed, 2012-05-09 at 15:24 +0200, Eduardo Morras wrote: > Sorry for the delay, and sorry again because in last mail i had the > wrong taste that it was some kind of homework problem. Don't worry ;). I simplified the core of the problem just to make it easier to understand (at least that was my intention xD) . And I wrote all the information that I found relevant in the opening post (320 CPU's, versions of OMPI, operative system and infiniband, etc.) precisely because I wanted to show that it wasn't a homework or anything like that. > > At 17:41 04/05/2012, you wrote: > > > The logic of send/recv looks ok. Now, in 5 and 7, recvSize(p2) and > > > recvSize(p1) function what value returns? > >All the sendSizes and RecvSizes are constant between iterations and are > >calculated as a setup before all the calculations start. > > <snip> > > >Do you know what could cause the program to hang with the default value > >(310) and to work fine with 305? I also tested it with 311 but it hanged > >so it seems that it is not enough to activate the SEND flag. > > On your code, the only point where it could fail is if one of the > precalculated message size values is wrongly calculated and executes > the Recieve where it shouldn't. Yes, but after the sizes are calculated they don't change and that's why I find it weird to hang the 30th time the whole communication loop is executed :S . > > From previous mails i understand that no if(ok!=MPI... line fires > and there's no Sender waiting. The Ssend ends when the Recv starts to > receive, not when the Recv ends the receive, so the sender may get an > Ok but if there's an error Recv keeps the block. As you are using > blocking communications, you can't do anything to prevent this, for > example, check the Recv status while receiving. I don't know how to check the Recv status because the processor remains waiting for the message at the Recv function. > Try to use Send instead Ssend (it should work but it could hang too) > or change design to a non-blocking approach.
The problem is that it also hangs with non-blocking communications. The real program is coded with non-blocking communications and it started to hang when the size of the mesh got bigger. I just changed to blocking communications to easy the debugging task. Now it works, with blocking and non-blocking communications, just changing the value of the mca parameter btl_openib_flags to 304 or 305 (the default value is 310). That means that the problem is with the RDMA protocols in infiniband for large messages. As far as I know, with those values the flags GET(4) and PUT(2) are deactivated and the protocol for large messages remains the same as the one for small messages (send/receive). For me, it seems that there is a bug (problably a memory leak) in OMPI or OFED. Thanks for your help, Jorge -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est� net.