Hi George

Sorry - This is not a valid MPI program.  It violates the requirement that
a program not depend on there being any system buffering.  See page 32-33
of MPI 1.1

 Lets simplify to:
Task 0:
MPI_Recv( from 1 with tag 1)
MPI_Recv( from 1 with tag 0)

Task 1:
MPI_Send(to 0 with tag 0)
MPI_Send(to 0 with tag 1)

Without any early arrival buffer (or with eager size set to 0) task 0 will
hang in the first MPI_Recv and never post a recv with tag 0.  Task 1 will
hang in the MPI_Send with tag 0 because it cannot get past it until the
matching recv is posted by task 0.

If there is enough early arrival buffer for the first MPI_Send on task 1 to
complete and the second MPI_Send to be posted the example will run. Once
both sends are posted by task 1, task 0 will harvest the second send and
get out of its first recv. Task 0's second recv can now pick up the message
from the early arrival buffer where it had to go to let task 1complete send
1 and post send 2.

If an application wants to do this kind of order inversion it should use
some non blocking operations.  For example, if task 0 posted an MPI_Irecv
for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait for the Irecv, the
example is valid.

I am not aware of any case where the standard allows a correct MPI program
to be deadlocked by an implementation limit.  It can be failed if it
exceeds a limit but I do not think it is ever OK to hang.

             Dick

Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/04/2008 04:41:21 PM:

> Please allow me to slightly modify your example. It still follow the
> rules from the MPI standard, so I think it's a 100% standard compliant
> parallel application.
>
> +------------------------------------------------------------+
> |                         task 0:                            |
> +------------------------------------------------------------+
> | MPI_Init()                                                 |
> | sleep(3000)                                                |
> | for( msg = 0; msg < 5000; msg++ ) {                        |
> |   for( peer = 0; peer < com_size; peer++ ) {               |
> |     MPI_Recv( ..., from = peer, tag = (5000 - msg),... );  |
> |   }                                                        |
> | }                                                          |
> +------------------------------------------------------------+
>
> +------------------------------------------------------------+
> |                   task 1 to com_size:                      |
> +------------------------------------------------------------+
> | MPI_Init()                                                 |
> | for( msg = 0; msg < 5000; msg++ ) {                        |
> |   MPI_Send( ..., 0, tag = msg, ... );                      |
> | }                                                          |
> +------------------------------------------------------------+
>
> Isn't that the flow control will stop the application to run to
> completion ? It's easy to write an application that break a particular
> MPI implementation. It doesn't necessarily make this implementation
> non standard compliant.
>
> george.
>
> On Feb 4, 2008, at 9:08 AM, Richard Treumann wrote:
>
> > Is what George says accurate? If so, it sounds to me like OpenMPI
> > does not comply with the MPI standard on the behavior of eager
> > protocol. MPICH is getting dinged in this discussion because they
> > have complied with the requirements of the MPI standard. IBM MPI
> > also complies with the standard.
> >
> > If there is any debate about whether the MPI standard does (or
> > should) require the behavior I describe below then we should move
> > the discussion to the MPI 2.1 Forum and get a clarification.
> >
> > To me, the MPI standard is clear that a program like this:
> >
> > task 0:
> > MPI_Init
> > sleep(3000);
> > start receiving messages
> >
> > each of tasks 1 to n-1:
> > MPI_Init
> > loop 5000 times
> > MPI_Send(small message to 0)
> > end loop
> >
> > May send some small messages eagerly if there is space at task 0 but
> > must block each task 1 to n-1 before allowing task 0 to run out of
> > eager buffer space. Doing this requires a token or credit management
> > system in which each task has credits for known buffer space at task
> > 0. Each task will send eagerly to task 0 until the sender runs out
> > of credits and then must switch to rendezvous protocol. Tasks 1to
> > n-1 might each do 3 MPI_Sends or 300 MPI_Sends before blocking,
> > depending on how much buffer space there is at task 0 but they would
> > need to block in some MPI_Send before task 0 blows up.
> >
> > When task 0 wakes up and begins receiving the early arrivals, tasks
> > 1 to n-1 will unblock and resume looping.. Allowing the user to shut
> > off eager protocol by setting eager size to 0 does not fix the
> > standards compliance issue. You must either have no eager protocol
> > at all or must have a eager message token/credit strategy.
> >
> > Dick
> >
> > Dick Treumann - MPI Team/TCEM
> > IBM Systems & Technology Group
> > Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> > Tele (845) 433-7846 Fax (845) 433-8363
> >
> >
> > users-boun...@open-mpi.org wrote on 02/03/2008 06:59:38 PM:
> >
> > > Well ... this is exactly the kind of behavior a high performance
> > > application try to achieve isn't it ?
> > >
> > > The problem here is not the flow control. What you need is to avoid
> > > buffering the messages on the receiver side. Luckily, Open MPI is
> > > entirely configurable at runtime, so this situation is really easy
> > to
> > > deal with even at the user level. Set the eager size to zero, and no
> > > buffering on the receiver side will be made. Your program will
> > survive
> > > as long as there is some available memory on the receiver.
> > >
> > >    Thanks,
> > >      George.
> > >
> > > On Feb 1, 2008, at 6:32 PM, 8mj6tc...@sneakemail.com wrote:
> > >
> > > > That would make sense. I able to break OpenMPI by having Node A
> > wait
> > > > for
> > > > messages from Node B. Node B is in fact sleeping while Node C
> > bombards
> > > > Node A with a few thousand messages. After a while Node B wakes
> > up and
> > > > sends Node A the message it's been waiting on, but Node A has long
> > > > since
> > > > been buried and seg faults. If I decrease the number of messages
> > C is
> > > > sending, it works properly. This was on OpenMPI 1.2.4 (using I
> > think
> > > > the
> > > > SM BTL (might have been MX or TCP, but certainly not infiniband. I
> > > > could
> > > > dig up the test and try again if anyone is seriously curious).
> > > >
> > > > Trying the same test on MPICH/MX went very very slow (I don't
> > think
> > > > they
> > > > have any clever buffer management) but it didn't crash.
> > > >
> > > > Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com
> > > > |openmpi-users/Allow| wrote:
> > > >> Hi,
> > > >>
> > > >> I am readying an openmpi 1.2.5 software stack for use with a
> > > >> many-thousand core cluster. I have a question about sending small
> > > >> messages that I hope can be answered on this list.
> > > >>
> > > >> I was under the impression that if node A wants to send a small
> > MPI
> > > >> message to node B, it must have a credit to do so. The credit
> > > >> assures A
> > > >> that B has enough buffer space to accept the message. Credits are
> > > >> required by the mpi layer regardless of the BTL transport layer
> > used.
> > > >>
> > > >> I have been told by a Voltaire tech that this is not so, the
> > > >> credits are
> > > >> used by the infiniband transport layer to reliably send a
> > message,
> > > >> and
> > > >> is not an openmpi feature.
> > > >>
> > > >> Thanks,
> > > >> Federico
> > > >>
> > > >> _______________________________________________
> > > >> users mailing list
> > > >> us...@open-mpi.org
> > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > >
> > > > --
> > > > --Kris
> > > >
> > > > 叶ってしまう夢は本当の夢と言えん。
> > > > [A dream that comes true can't really be called a dream.]
> > > > _______________________________________________
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > [attachment "smime.p7s" deleted by Richard
> > > Treumann/Poughkeepsie/IBM]
> > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> [attachment "smime.p7s" deleted by Richard
> Treumann/Poughkeepsie/IBM]
_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to