On Wed, 2012-05-09 at 15:24 +0200, Eduardo Morras wrote:
> Sorry for the delay, and sorry again because in last mail i had the 
> wrong taste that it was some kind of homework problem.
Don't worry ;). 
I simplified the core of the problem just to make it easier to
understand (at least that was my intention xD) . And I wrote all the
information that I found relevant in the opening post (320 CPU's,
versions of OMPI, operative system and infiniband, etc.) precisely
because I wanted to show that it wasn't a homework or anything like
that.
> 
> At 17:41 04/05/2012, you wrote:
> > > The logic of send/recv looks ok. Now, in 5 and 7, recvSize(p2) and
> > > recvSize(p1) function what value returns?
> >All the sendSizes and RecvSizes are constant between iterations and are
> >calculated as a setup before all the calculations start.
> 
> <snip>
> 
> >Do you know what could cause the program to hang with the default value
> >(310) and to work fine with 305? I also tested it with 311 but it hanged
> >so it seems that it is not enough to activate the SEND flag.
> 
> On your code, the only point where it could fail is if one of the 
> precalculated message size values is wrongly calculated and executes 
> the Recieve where it shouldn't.
Yes, but after the sizes are calculated they don't change and that's why
I find it weird to hang the 30th time the whole communication loop is
executed :S .
> 
>  From previous mails i understand that no if(ok!=MPI... line fires 
> and there's no Sender waiting. The Ssend ends when the Recv starts to 
> receive, not when the Recv ends the receive, so the sender may get an 
> Ok but if there's an error Recv keeps the block. As you are using 
> blocking communications, you can't do anything to prevent this, for 
> example, check the Recv status while receiving.
I don't know how to check the Recv status because the processor remains
waiting for the message at the Recv function.
> Try to use Send instead Ssend (it should work but it could hang too) 
> or change design to a non-blocking approach.

The problem is that it also hangs with non-blocking communications. The
real program is coded with non-blocking communications and it started to
hang when the size of the mesh got bigger. I just changed to blocking
communications to easy the debugging task.

Now it works, with blocking and non-blocking communications, just
changing the value of the mca parameter btl_openib_flags to 304 or 305
(the default value is 310). That means that the problem is with the RDMA
protocols in infiniband for large messages. As far as I know, with those
values the flags GET(4) and PUT(2) are deactivated and the protocol for
large messages remains the same as the one for small messages
(send/receive). For me, it seems that there is a bug (problably a memory
leak) in OMPI or OFED.

Thanks for your help,
Jorge


-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que est� net.

Reply via email to