Let's back up and ask a question: is there a reason you're using Bsend? I.e., do you need to use Bsend for some reason, or can you use regular (potentially non-buffering) sends instead?
On Mar 18, 2020, at 5:16 AM, Martyn Foster via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Hi George, Thanks for the reply. I agree that the the behaviour isn't outside of the MPI standard (perhaps I shouldn't have used "fault" in the title!). I guess from a utility perspective, my comment is that its undesirable that the application performs a hard stop when it appears that it can safely proceed with a modest code change to stall to allow other ops to complete. Is it worth proposing a patch to that effect? Of course the application could be coded better in this area, but these things are not always trivial - in this case it appears I need to allocate >1GB or more to reliably execute without a mod to the OMPI source. Martyn On Tue, 17 Mar 2020 at 15:20, George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote: Martyn, I don't know exactly what your code is doing, but based on your inquiry I assume you are using MPI_BSEND multiple times and you run out of local buffers. The MPI standard does not mandate a wait until buffer space becomes available, because that can lead to deadlocks (communication pattern depends on a local receive that will be posted after the bsend loop). Instead, the MPI standard states it is the user's responsibility to ensure enough buffer is available before calling MPI_BSEND, MPI3.2 page 39 line 36, "then MPI must buffer the outgoing message, so as to allow the send to complete. An error will occur if there is insufficient buffer space". For blocking buffered sends this is a gray area because from a user perspective it is difficult to know when you can safely reuse the buffer without implementing some kind of feedback mechanism to confirm the reception. For nonblocking the constraint is relaxed as indicated on page 55 line 33, "Successful return of MPI_WAIT after a MPI_IBSEND implies that the user buffer can be reused". In short, you should always make sure you have enough available buffer space for your buffered sends to be able to locally pack the data to be sent, or be ready to deal with the error returned by MPI (this part would not be portable across different MPI implementations). George. On Tue, Mar 17, 2020 at 7:59 AM Martyn Foster via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Hi all, I'm new here, so please be gentle :-) Versions: OpenMPI 4.0.3rc1, UCX 1.7 I have a hang in an application (OK for small data sets, but fails with a larger one). The error is "bsend: failed to allocate buffer" This comes from pml_ucx.c:693 mca_pml_ucx_bsend( ... ) ... packed_data = mca_pml_base_bsend_request_alloc_buf(packed_length); if (OPAL_UNLIKELY(NULL == packed_data)) { OBJ_DESTRUCT(&opal_conv); PML_UCX_ERROR( "bsend: failed to allocate buffer"); return UCS_STATUS_PTR(OMPI_ERROR); } In fact the request appears to be 1.3MB and the bsend buffer is (should be!) 128MB In pml_base_bsend:332 void* mca_pml_base_bsend_request_alloc_buf( size_t length ) { void* buf = NULL; /* has a buffer been provided */ OPAL_THREAD_LOCK(&mca_pml_bsend_mutex); if(NULL == mca_pml_bsend_addr) { OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex); return NULL; } /* allocate a buffer to hold packed message */ buf = mca_pml_bsend_allocator->alc_alloc( mca_pml_bsend_allocator, length, 0); if(NULL == buf) { /* release resources when request is freed */ OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex); /* progress communications, with the hope that more resources * will be freed */ opal_progress(); return NULL; } /* increment count of pending requests */ mca_pml_bsend_count++; OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex); return buf; } It seems that there is a strong hint here that we can wait for the bsend buffer to become available, and yet mca_pml_ucx_bsend doesn't have a retry mechanism and just fails on the first attempt. A simple hack to turn the "if(NULL == buf) {" into a "while(NULL == buf) {" in mca_pml_base_bsend_request_alloc_buf seems to support this (the application proceeds after a few milliseconds)... Is this hypothesis correct? Best regards, Martyn -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com>