Re: [OMPI users] Fault in not recycling bsend buffer ?

Jeff Squyres (jsquyres) via users Wed, 18 Mar 2020 07:55:51 -0700

Let's back up and ask a question: is there a reason you're using Bsend?

I.e., do you need to use Bsend for some reason, or can you use regular 
(potentially non-buffering) sends instead?




On Mar 18, 2020, at 5:16 AM, Martyn Foster via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

Hi George,

Thanks for the reply. I agree that the the behaviour isn't outside of the MPI 
standard (perhaps I shouldn't have used "fault" in the title!). I guess from a 
utility perspective, my comment is that its undesirable that the application 
performs a hard stop when it appears that it can safely proceed with a modest 
code change to stall to allow other ops to complete. Is it worth proposing a 
patch to that effect?

Of course the application could be coded better in this area, but these things 
are not always trivial - in this case it appears I need to allocate >1GB or 
more to reliably execute without a mod to the OMPI source.

Martyn

On Tue, 17 Mar 2020 at 15:20, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
Martyn,

I don't know exactly what your code is doing, but based on your inquiry I 
assume you are using MPI_BSEND multiple times and you run out of local buffers.

The MPI standard does not mandate a wait until buffer space becomes available, 
because that can lead to deadlocks (communication pattern depends on a local 
receive that will be posted after the bsend loop). Instead, the MPI standard 
states it is the user's responsibility to ensure enough buffer is available 
before calling MPI_BSEND, MPI3.2 page 39 line 36, "then MPI must buffer the 
outgoing message, so as to allow the send to complete. An error will occur if 
there is insufficient buffer space". For blocking buffered sends this is a gray 
area because from a user perspective it is difficult to know when you can 
safely reuse the buffer without implementing some kind of feedback mechanism to 
confirm the reception. For nonblocking the constraint is relaxed as indicated 
on page 55 line 33, "Successful return of MPI_WAIT after a MPI_IBSEND implies 
that the user buffer can be reused".

In short, you should always make sure you have enough available buffer space 
for your buffered sends to be able to locally pack the data to be sent, or be 
ready to deal with the error returned by MPI (this part would not be portable 
across different MPI implementations).

  George.




On Tue, Mar 17, 2020 at 7:59 AM Martyn Foster via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
Hi all,

I'm new here, so please be gentle :-)

Versions: OpenMPI 4.0.3rc1, UCX 1.7

I have a hang in an application (OK for small data sets, but fails with a 
larger one). The error is

"bsend: failed to allocate buffer"

This comes from

pml_ucx.c:693
mca_pml_ucx_bsend( ... )
...
packed_data = mca_pml_base_bsend_request_alloc_buf(packed_length);
    if (OPAL_UNLIKELY(NULL == packed_data)) {
        OBJ_DESTRUCT(&opal_conv);
        PML_UCX_ERROR( "bsend: failed to allocate buffer");
        return UCS_STATUS_PTR(OMPI_ERROR);
    }

In fact the request appears to be 1.3MB and the bsend buffer is (should be!) 
128MB

In pml_base_bsend:332
void*  mca_pml_base_bsend_request_alloc_buf( size_t length )
{
   void* buf = NULL;
    /* has a buffer been provided */
    OPAL_THREAD_LOCK(&mca_pml_bsend_mutex);
    if(NULL == mca_pml_bsend_addr) {
        OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);
        return NULL;
    }

    /* allocate a buffer to hold packed message */
    buf = mca_pml_bsend_allocator->alc_alloc(
        mca_pml_bsend_allocator, length, 0);
    if(NULL == buf) {
        /* release resources when request is freed */
        OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);
        /* progress communications, with the hope that more resources
         *   will be freed */
        opal_progress();
        return NULL;
    }

    /* increment count of pending requests */
    mca_pml_bsend_count++;
    OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);

    return buf;
}

It seems that there is a strong hint here that we can wait for the bsend buffer 
to become available, and yet mca_pml_ucx_bsend doesn't have a retry mechanism 
and just fails on the first attempt. A simple hack to turn the "if(NULL == buf) 
{" into a "while(NULL == buf) {"  in mca_pml_base_bsend_request_alloc_buf seems 
to support this (the application proceeds after a few milliseconds)...

Is this hypothesis correct?

Best regards, Martyn




--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

Re: [OMPI users] Fault in not recycling bsend buffer ?

Reply via email to