The words 'eager', 'rendezvous' or 'credit' have a specific resonance only for implementors and i think it's correct that the MPI specification sidestep these words since they are artifacts of implementation.
All implementations make their own guarantees and run into their own different limitations. I'd expect a lot of them to blow up in various areas if one were to start a 128k processor job -- but that wouldn't necessarily make them non-compliant. I agree that providing correctness in the face of sleeping processes is a fine goal to strive for. However, there's also arguments for why being too restrictive on credit management can hurt since this problem is a memory buffering trade-off: memory and buffering can be much cheaper compared to extra bandwidth and latency that may be incurred by being too restrictive with MPI-level credit management. Also, using more buffering can also help loosely synchronized jobs for a variety of reasons -- from either the way the application is coded or the size and nature of the machine it's run on. . . christian On Mon, 04 Feb 2008, Richard Treumann wrote: > > Is what George says accurate? If so, it sounds to me like OpenMPI does not > comply with the MPI standard on the behavior of eager protocol. MPICH is > getting dinged in this discussion because they have complied with the > requirements of the MPI standard. IBM MPI also complies with the standard. > > If there is any debate about whether the MPI standard does (or should) > require the behavior I describe below then we should move the discussion to > the MPI 2.1 Forum and get a clarification. > > To me, the MPI standard is clear that a program like this: > > task 0: > MPI_Init > sleep(3000); > start receiving messages > > each of tasks 1 to n-1: > MPI_Init > loop 5000 times > MPI_Send(small message to 0) > end loop > > May send some small messages eagerly if there is space at task 0 but must > block each task 1 to n-1 before allowing task 0 to run out of eager buffer > space. Doing this requires a token or credit management system in which > each task has credits for known buffer space at task 0. Each task will send > eagerly to task 0 until the sender runs out of credits and then must switch > to rendezvous protocol. Tasks 1to n-1 might each do 3 MPI_Sends or 300 > MPI_Sends before blocking, depending on how much buffer space there is at > task 0 but they would need to block in some MPI_Send before task 0 blows > up. > > When task 0 wakes up and begins receiving the early arrivals, tasks 1 to > n-1 will unblock and resume looping.. Allowing the user to shut off eager > protocol by setting eager size to 0 does not fix the standards compliance > issue. You must either have no eager protocol at all or must have a eager > message token/credit strategy. > > Dick > > Dick Treumann - MPI Team/TCEM > IBM Systems & Technology Group > Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 > Tele (845) 433-7846 Fax (845) 433-8363 > > > users-boun...@open-mpi.org wrote on 02/03/2008 06:59:38 PM: > > > Well ... this is exactly the kind of behavior a high performance > > application try to achieve isn't it ? > > > > The problem here is not the flow control. What you need is to avoid > > buffering the messages on the receiver side. Luckily, Open MPI is > > entirely configurable at runtime, so this situation is really easy to > > deal with even at the user level. Set the eager size to zero, and no > > buffering on the receiver side will be made. Your program will survive > > as long as there is some available memory on the receiver. > > > > Thanks, > > George. > > > > On Feb 1, 2008, at 6:32 PM, 8mj6tc...@sneakemail.com wrote: > > > > > That would make sense. I able to break OpenMPI by having Node A wait > > > for > > > messages from Node B. Node B is in fact sleeping while Node C bombards > > > Node A with a few thousand messages. After a while Node B wakes up and > > > sends Node A the message it's been waiting on, but Node A has long > > > since > > > been buried and seg faults. If I decrease the number of messages C is > > > sending, it works properly. This was on OpenMPI 1.2.4 (using I think > > > the > > > SM BTL (might have been MX or TCP, but certainly not infiniband. I > > > could > > > dig up the test and try again if anyone is seriously curious). > > > > > > Trying the same test on MPICH/MX went very very slow (I don't think > > > they > > > have any clever buffer management) but it didn't crash. > > > > > > Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com > > > |openmpi-users/Allow| wrote: > > >> Hi, > > >> > > >> I am readying an openmpi 1.2.5 software stack for use with a > > >> many-thousand core cluster. I have a question about sending small > > >> messages that I hope can be answered on this list. > > >> > > >> I was under the impression that if node A wants to send a small MPI > > >> message to node B, it must have a credit to do so. The credit > > >> assures A > > >> that B has enough buffer space to accept the message. Credits are > > >> required by the mpi layer regardless of the BTL transport layer used. > > >> > > >> I have been told by a Voltaire tech that this is not so, the > > >> credits are > > >> used by the infiniband transport layer to reliably send a message, > > >> and > > >> is not an openmpi feature. > > >> > > >> Thanks, > > >> Federico > > >> > > >> _______________________________________________ > > >> users mailing list > > >> us...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > -- > > > --Kris > > > > > > ?$B3p$C$F$7$^$&L4$OK\Ev$NL4$H8@$($s!# > > > [A dream that comes true can't really be called a dream.] > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > [attachment "smime.p7s" deleted by Richard > > Treumann/Poughkeepsie/IBM] > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- christian.b...@qlogic.com (QLogic Host Solutions Group, formerly Pathscale)