An application that rely on MPI eager buffers for correctness or performance is an incorrect application. Among many other points simply because MPI implementations without support for eager are legit. Moreover, these applications also miss the point on performance. Among the overheads I am not only talking about the memory allocations by MPI to store the eager data, or the additional memcpy needed to put that data back into userland once the corresponding request is posted. But also about stressing the unexpected messages path in the MPI library, creating potentially long chains of unexpected messages that need to be traversed in order to guarantee the FIFO matching required by MPI.
In the same idea as Jeff, if you want a portable and efficient MPI application then assume eager is always 0 and prepost all your receives. George. PS: In OMPI the eager size is provided by the underlying transport, so the BTL, and can be changed this via MCA. 'ompi_info --param btl all -l 4 | grep eager' should give you the full list. On Thu, Mar 26, 2020 at 10:00 AM Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > On Mar 26, 2020, at 5:36 AM, Raut, S Biplab <biplab.r...@amd.com> wrote: > > > > I am doing pairwise send-recv and not all-to-all since not all the data > is required by all the ranks. > > And I am doing blocking send and recv calls since there are multiple > iterations of such message chunks to be sent with synchronization. > > > > I understand your recommendation in the below mail, however I still see > benefit for my application level algorithm to do pairwise send-recv chunks > where each chunk is within eager limit. > > Since the input and output buffer is same within the process, so I can > avoid certain buffering at each sender rank by doing successive send calls > within eager limit to receiver ranks and then have recv calls. > > But if the buffers are small enough to fall within the eager limit, > there's very little benefit to not having an A/B buffering scheme. Sure, > it's 2x the memory, but it's 2 times a small number (measured in KB). > Assuming you have GB of RAM, it's hard to believe that this would make a > meaningful difference. Indeed, one way to think of the eager limit is: > "it's small enough that the cost of a memcpy doesn't matter." > > I'm not sure I understand your comments about preventing copying. MPI > will always do the most efficient thing to send the message, regardless of > whether it is under the eager limit or not. I also don't quite grok your > comments about "application buffering" and message buffering required by > the eager protocol. > > The short version of this is: you shouldn't worry about any of this. Rely > on the underlying MPI to do the most efficient thing possible, and you > should use a communication algorithm that makes sense for your > application. In most cases, you'll be good. > > If you start trying to tune for a specific environment, platform, and MPI > implementation, the number of variables grows exponentially. And if you > change any one parameter in the whole setup, your optimizations may get > lost. Also, if you add a bunch of infrastructure in your app to try to > exactly match your environment+platform+implementation (e.g., manual > segmenting to fit your overall message into the eager limit), you may just > be adding additional overhead that effectively nullifies any optimization > you might get (especially if the optimization is very small). Indeed, the > methods used for shared memory and similar to but different than the > methods used for networks. And there's a wide variety of network > capabilities; some can be more efficient than others (depending on a > zillion factors). > > If you're using shared memory, ensure that your Linux kernel has good > shared memory support (e.g., support for CMA), and let MPI optimize the > message transfers for you. > > -- > Jeff Squyres > jsquy...@cisco.com > >