Hi George,

Thanks for the reply. I agree that the the behaviour isn't outside of the
MPI standard (perhaps I shouldn't have used "fault" in the title!). I guess
from a utility perspective, my comment is that its undesirable that the
application performs a hard stop when it appears that it can safely proceed
with a modest code change to stall to allow other ops to complete. Is it
worth proposing a patch to that effect?

Of course the application could be coded better in this area, but these
things are not always trivial - in this case it appears I need to allocate
>1GB or more to reliably execute without a mod to the OMPI source.

Martyn

On Tue, 17 Mar 2020 at 15:20, George Bosilca <bosi...@icl.utk.edu> wrote:

> Martyn,
>
> I don't know exactly what your code is doing, but based on your inquiry I
> assume you are using MPI_BSEND multiple times and you run out of local
> buffers.
>
> The MPI standard does not mandate a wait until buffer space becomes
> available, because that can lead to deadlocks (communication pattern
> depends on a local receive that will be posted after the bsend loop).
> Instead, the MPI standard states it is the user's responsibility to ensure
> enough buffer is available before calling MPI_BSEND, MPI3.2 page 39 line
> 36, "then MPI must buffer the outgoing message, so as to allow the send to
> complete. An error will occur if there is insufficient buffer space". For
> blocking buffered sends this is a gray area because from a user perspective
> it is difficult to know when you can safely reuse the buffer without
> implementing some kind of feedback mechanism to confirm the reception. For
> nonblocking the constraint is relaxed as indicated on page 55 line 33,
> "Successful return of MPI_WAIT after a MPI_IBSEND implies that the user
> buffer can be reused".
>
> In short, you should always make sure you have enough available buffer
> space for your buffered sends to be able to locally pack the data to be
> sent, or be ready to deal with the error returned by MPI (this part would
> not be portable across different MPI implementations).
>
>   George.
>
>
>
>
> On Tue, Mar 17, 2020 at 7:59 AM Martyn Foster via users <
> users@lists.open-mpi.org> wrote:
>
>> Hi all,
>>
>> I'm new here, so please be gentle :-)
>>
>> Versions: OpenMPI 4.0.3rc1, UCX 1.7
>>
>> I have a hang in an application (OK for small data sets, but fails with a
>> larger one). The error is
>>
>> "bsend: failed to allocate buffer"
>>
>> This comes from
>>
>> pml_ucx.c:693
>> mca_pml_ucx_bsend( ... )
>> ...
>> packed_data = mca_pml_base_bsend_request_alloc_buf(packed_length);
>>     if (OPAL_UNLIKELY(NULL == packed_data)) {
>>         OBJ_DESTRUCT(&opal_conv);
>>         PML_UCX_ERROR( "bsend: failed to allocate buffer");
>>         return UCS_STATUS_PTR(OMPI_ERROR);
>>     }
>>
>> In fact the request appears to be 1.3MB and the bsend buffer is (should
>> be!) 128MB
>>
>> In pml_base_bsend:332
>> void*  mca_pml_base_bsend_request_alloc_buf( size_t length )
>> {
>>    void* buf = NULL;
>>     /* has a buffer been provided */
>>     OPAL_THREAD_LOCK(&mca_pml_bsend_mutex);
>>     if(NULL == mca_pml_bsend_addr) {
>>         OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);
>>         return NULL;
>>     }
>>
>>     /* allocate a buffer to hold packed message */
>>     buf = mca_pml_bsend_allocator->alc_alloc(
>>         mca_pml_bsend_allocator, length, 0);
>>     if(NULL == buf) {
>>         /* release resources when request is freed */
>>         OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);
>>         /* progress communications, with the hope that more resources
>>          *   will be freed */
>>         opal_progress();
>>         return NULL;
>>     }
>>
>>     /* increment count of pending requests */
>>     mca_pml_bsend_count++;
>>     OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);
>>
>>     return buf;
>> }
>>
>> It seems that there is a strong hint here that we can wait for the bsend
>> buffer to become available, and yet mca_pml_ucx_bsend doesn't have a retry
>> mechanism and just fails on the first attempt. A simple hack to turn the 
>> "if(NULL
>> == buf) {" into a "while(NULL == buf) {"
>> in mca_pml_base_bsend_request_alloc_buf seems to support this (the
>> application proceeds after a few milliseconds)...
>>
>> Is this hypothesis correct?
>>
>> Best regards, Martyn
>>
>>
>>

Reply via email to