On Thu, Jun 4, 2009 at 2:54 PM, Lars Andersson <lars...@gmail.com> wrote: > Hi Gus, > > Thanks for the suggestion. I've been thinking along those lines, but > it seems to have drawbacks. Consider the following MPI conversation: > > Time NODE 1 NODE 2 > 0 local work local work > 1 post n-b recv local work > 2 local work post n-b send > 3 complete recv in 1 local work
Sorry, that formatting didn't come out very well. Another attempt: Time......NODE 1.......................NODE 2 0............local work....................local work 1............post n-b recv...............local work 2............local work....................post n-b send 3............complete recv in 1......local work Hopefully you get the idea... /Lars > In an ideal implementation, NODE 1 would be able to go back to local > work immediately after posting a non blocking receive at t=1. > > If using blocking message passing for the initial header, NODE 1 would > have to block at least until t=2, when NODE 2 sends the corresponding > message header. Node 1 can then go on doing local work while the main > message data is being transferred, but it still wastes 1 time unit > waiting for a message header to arrive. > > Is there some clever way around this? Am I missing something? > > /Lars > > > > On Thu, Jun 4, 2009 at 2:34 PM, Lars Andersson <lars...@gmail.com> wrote: >> Hi Lars >> >> I wonder if you could always use blocking message passing on the >> preliminary send/receive pair that transmits the message size/header, >> then use non-blocking mode for the actual message. >> If the "message size/header" part transmits a small buffer, >> the preliminary send/recv pair will use the "eager" communication mode, >> return quickly, and may not reduce performance, I would guess. >> >> For a group of several messages the preliminary >> send/recv pair could transmit a small (to ensure "eager mode") >> array of message sizes, >> maybe along with the message tags and sender ranks, >> instead of only one size. >> >> Just a thought. >> >> Gus Correa >> --------------------------------------------------------------------- >> Gustavo Correa >> Lamont-Doherty Earth Observatory - Columbia University >> Palisades, NY, 10964-8000 - USA >> --------------------------------------------------------------------- >> >> Lars Andersson wrote: >>> Hi, >>> >>> I'm trying to solve a problem of passing serializable, arbitrarily >>> sized objects around using MPI and non-blocking communication. The >>> problem I'm facing is what to do at the receiving end when expecting >>> an object of unknown size, but at the same time not block on waiting >>> for it. >>> >>> When using blocking message passing, I have simply solved the problem >>> by first sending a small, fixed size header containing the size of >>> rest of the data, sent in the following mpi message. When using >>> non-blocking message passing, this doesn't seem to be such a good >>> idea, since we cant post the main data transfer until we have received >>> the message header... It seems to take away most of the advantages on >>> non-blocking io in the first place. >>> >>> >>> I've been thinking about solving this using MPI_Probe / MPI_IProbe, >>> but I'm worried about performance. >>> >>> >>> Question 1: >>> >>> Will MPI_Probe or the underlying MPI implementation actually receive >>> the full message data (assuming reasonably sized message, like less >>> than 10MB) before MPI_Probe returns? Or will there be a significant >>> data transfer delay (for large messages) when calling MPI_Recv after a >>> successful MPI_Probe? >>> >>> >>> >>> What I want is something like this: >>> >>> 1) post one or several non-blocking, variable sized message receives >>> >>> 2) do other, non-MPI work, while any incoming messages will be fully >>> received into >>> buffers on the local machine. >>> >>> 3) perform completion of the receives posted in 1). I don't want to >>> unnecessarily >>> wait here for data transfers that could have taken place during 2). >>> >>> >>> Problems: >>> >>> I can't post non-blocking MPI_Irecv() calls in 1, because I don't know >>> the sizes of incoming messages. >>> >>> If I simply do nothing in 1, and call MPI_Probe in 3, I'm worried that >>> I won't get nice compute/transfer overlap because the messages wont >>> actually be received locally until I post a Probe or Recv in 3. >>> >>> >>> Question 2: >>> >>> How can I achieve the communication sequence described in 1,2,3 above, >>> with overlapping data transfer and local computation during 2? >>> >>> >>> Question 3: >>> >>> A temporary kludge solution to the problem above might be to allocate >>> a temporary receive buffer of some arbitrary, constant maximum size >>> BUFSIZE in 1 for each non-blocking receive operation, make sure >>> messages sent are not larger than BUFSIZE, and post MPI_Irecv(buffer, >>> BUFSIZE,...) calls in 1. I haven't been able to figure out if it's >>> actually correct and portable to receive less data than specified in >>> the count argument to MPI_Irecv. >>> >>> What if the message sent on the other end is 10 bytes, and >>> BUFSIZE=count=20. Would that be OK? >>> >>> >>> If anyone can shed any light on this, I'd be grateful. FYI, we're >>> using a cluster of 2-8 core x86-64 machines running Linux and >>> connected using ordinary 1Gbit ethernet. >>> >>> >>> Best regards, >>> >>> Lars Andersson >>> _______________________________________________ >>> users mailing list >>> users_at_[hidden] >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >