At 16:19 09/05/2012, you wrote:
> On your code, the only point where it could fail is if one of the > precalculated message size values is wrongly calculated and executes > the Recieve where it shouldn't. Yes, but after the sizes are calculated they don't change and that's why I find it weird to hang the 30th time the whole communication loop is executed :S .
If in your code you don't use sizeof with MPI Datatypes there should be no problem :)
> > From previous mails i understand that no if(ok!=MPI... line fires > and there's no Sender waiting. The Ssend ends when the Recv starts to > receive, not when the Recv ends the receive, so the sender may get an > Ok but if there's an error Recv keeps the block. As you are using > blocking communications, you can't do anything to prevent this, for > example, check the Recv status while receiving. I don't know how to check the Recv status because the processor remains waiting for the message at the Recv function.
That's what i'm pointing. In block mode you can't check that until Recv ends.
> Try to use Send instead Ssend (it should work but it could hang too) > or change design to a non-blocking approach. The problem is that it also hangs with non-blocking communications. The real program is coded with non-blocking communications and it started to hang when the size of the mesh got bigger. I just changed to blocking communications to easy the debugging task. Now it works, with blocking and non-blocking communications, just changing the value of the mca parameter btl_openib_flags to 304 or 305 (the default value is 310). That means that the problem is with the RDMA protocols in infiniband for large messages. As far as I know, with those values the flags GET(4) and PUT(2) are deactivated and the protocol for large messages remains the same as the one for small messages (send/receive). For me, it seems that there is a bug (problably a memory leak) in OMPI or OFED.
Some memory leaks were solved in 1.4.5. that affects openib, see release notes.
Thanks for your help, Jorge