[OMPI users] Fault in not recycling bsend buffer ?
Hi all, I'm new here, so please be gentle :-) Versions: OpenMPI 4.0.3rc1, UCX 1.7 I have a hang in an application (OK for small data sets, but fails with a larger one). The error is "bsend: failed to allocate buffer" This comes from pml_ucx.c:693 mca_pml_ucx_bsend( ... ) ... packed_data = mca_pml_base_bsend_request_alloc_buf(packed_length); if (OPAL_UNLIKELY(NULL == packed_data)) { OBJ_DESTRUCT(&opal_conv); PML_UCX_ERROR( "bsend: failed to allocate buffer"); return UCS_STATUS_PTR(OMPI_ERROR); } In fact the request appears to be 1.3MB and the bsend buffer is (should be!) 128MB In pml_base_bsend:332 void* mca_pml_base_bsend_request_alloc_buf( size_t length ) { void* buf = NULL; /* has a buffer been provided */ OPAL_THREAD_LOCK(&mca_pml_bsend_mutex); if(NULL == mca_pml_bsend_addr) { OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex); return NULL; } /* allocate a buffer to hold packed message */ buf = mca_pml_bsend_allocator->alc_alloc( mca_pml_bsend_allocator, length, 0); if(NULL == buf) { /* release resources when request is freed */ OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex); /* progress communications, with the hope that more resources * will be freed */ opal_progress(); return NULL; } /* increment count of pending requests */ mca_pml_bsend_count++; OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex); return buf; } It seems that there is a strong hint here that we can wait for the bsend buffer to become available, and yet mca_pml_ucx_bsend doesn't have a retry mechanism and just fails on the first attempt. A simple hack to turn the "if(NULL == buf) {" into a "while(NULL == buf) {" in mca_pml_base_bsend_request_alloc_buf seems to support this (the application proceeds after a few milliseconds)... Is this hypothesis correct? Best regards, Martyn
Re: [OMPI users] Fault in not recycling bsend buffer ?
Martyn, I don't know exactly what your code is doing, but based on your inquiry I assume you are using MPI_BSEND multiple times and you run out of local buffers. The MPI standard does not mandate a wait until buffer space becomes available, because that can lead to deadlocks (communication pattern depends on a local receive that will be posted after the bsend loop). Instead, the MPI standard states it is the user's responsibility to ensure enough buffer is available before calling MPI_BSEND, MPI3.2 page 39 line 36, "then MPI must buffer the outgoing message, so as to allow the send to complete. An error will occur if there is insufficient buffer space". For blocking buffered sends this is a gray area because from a user perspective it is difficult to know when you can safely reuse the buffer without implementing some kind of feedback mechanism to confirm the reception. For nonblocking the constraint is relaxed as indicated on page 55 line 33, "Successful return of MPI_WAIT after a MPI_IBSEND implies that the user buffer can be reused". In short, you should always make sure you have enough available buffer space for your buffered sends to be able to locally pack the data to be sent, or be ready to deal with the error returned by MPI (this part would not be portable across different MPI implementations). George. On Tue, Mar 17, 2020 at 7:59 AM Martyn Foster via users < users@lists.open-mpi.org> wrote: > Hi all, > > I'm new here, so please be gentle :-) > > Versions: OpenMPI 4.0.3rc1, UCX 1.7 > > I have a hang in an application (OK for small data sets, but fails with a > larger one). The error is > > "bsend: failed to allocate buffer" > > This comes from > > pml_ucx.c:693 > mca_pml_ucx_bsend( ... ) > ... > packed_data = mca_pml_base_bsend_request_alloc_buf(packed_length); > if (OPAL_UNLIKELY(NULL == packed_data)) { > OBJ_DESTRUCT(&opal_conv); > PML_UCX_ERROR( "bsend: failed to allocate buffer"); > return UCS_STATUS_PTR(OMPI_ERROR); > } > > In fact the request appears to be 1.3MB and the bsend buffer is (should > be!) 128MB > > In pml_base_bsend:332 > void* mca_pml_base_bsend_request_alloc_buf( size_t length ) > { >void* buf = NULL; > /* has a buffer been provided */ > OPAL_THREAD_LOCK(&mca_pml_bsend_mutex); > if(NULL == mca_pml_bsend_addr) { > OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex); > return NULL; > } > > /* allocate a buffer to hold packed message */ > buf = mca_pml_bsend_allocator->alc_alloc( > mca_pml_bsend_allocator, length, 0); > if(NULL == buf) { > /* release resources when request is freed */ > OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex); > /* progress communications, with the hope that more resources > * will be freed */ > opal_progress(); > return NULL; > } > > /* increment count of pending requests */ > mca_pml_bsend_count++; > OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex); > > return buf; > } > > It seems that there is a strong hint here that we can wait for the bsend > buffer to become available, and yet mca_pml_ucx_bsend doesn't have a retry > mechanism and just fails on the first attempt. A simple hack to turn the > "if(NULL > == buf) {" into a "while(NULL == buf) {" > in mca_pml_base_bsend_request_alloc_buf seems to support this (the > application proceeds after a few milliseconds)... > > Is this hypothesis correct? > > Best regards, Martyn > > >
Re: [OMPI users] Limits of communicator size and number of parallel broadcast transmissions
On Mon, Mar 16, 2020 at 6:15 PM Konstantinos Konstantinidis via users < users@lists.open-mpi.org> wrote: > Hi, I have some questions regarding technical details of MPI collective > communication methods and broadcast: > >- I want to understand when the number of receivers in a MPI_Bcast can >be a problem slowing down the broadcast. > > This is a pretty strong claim. Do you mind sharing with us the data that allowed you to reach such a conclusion? > >- There are a few implementations of MPI_Bcast. Consider that of a >binary tree. In this case, the sender (root) transmits the common >message to its two children and each them to two more and so on. Is it >accurate to say that in each level of the tree all transmissions happen in >parallel or only one transmission can be done from each node? > > Neither of those. Assuming the 2 children are both on different nodes, one might not want to split the outgoing bandwidth of the parent between the 2 children, but instead order. In this case, one of the children will be serviced first, while the other one is waiting, and then in a binary tree, the second is serviced while the first is waiting. So, the binary tree communication pattern maximize the outgoing bandwidth of some of the nodes, but not the overall bisectional bandwidth of the machine. This is not necessarily true for hardware-level collectives. If the switches propagate the information they might be able to push the data out to multiple other switches, more or less in parallel. > >- To that end, is there a limit on the number of processes a process >can broadcast to in parallel? > > No? I am not sure I understand the context of this question. I would say that as long as the collective communication is implemented over point-to-point communications, the limit is 1. > >- Since each MPI_Bcast is associated with a communicator is there a >limit on the number of processes a communicator can have and if so what is >it in Open MPI? > > No, there is no such limit in MPI, or in Open MPI. George. > > Regards, > Kostas >