[OMPI users] Fault in not recycling bsend buffer ?

2020-03-17 Thread Martyn Foster via users
Hi all,

I'm new here, so please be gentle :-)

Versions: OpenMPI 4.0.3rc1, UCX 1.7

I have a hang in an application (OK for small data sets, but fails with a
larger one). The error is

"bsend: failed to allocate buffer"

This comes from

pml_ucx.c:693
mca_pml_ucx_bsend( ... )
...
packed_data = mca_pml_base_bsend_request_alloc_buf(packed_length);
if (OPAL_UNLIKELY(NULL == packed_data)) {
OBJ_DESTRUCT(&opal_conv);
PML_UCX_ERROR( "bsend: failed to allocate buffer");
return UCS_STATUS_PTR(OMPI_ERROR);
}

In fact the request appears to be 1.3MB and the bsend buffer is (should
be!) 128MB

In pml_base_bsend:332
void*  mca_pml_base_bsend_request_alloc_buf( size_t length )
{
   void* buf = NULL;
/* has a buffer been provided */
OPAL_THREAD_LOCK(&mca_pml_bsend_mutex);
if(NULL == mca_pml_bsend_addr) {
OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);
return NULL;
}

/* allocate a buffer to hold packed message */
buf = mca_pml_bsend_allocator->alc_alloc(
mca_pml_bsend_allocator, length, 0);
if(NULL == buf) {
/* release resources when request is freed */
OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);
/* progress communications, with the hope that more resources
 *   will be freed */
opal_progress();
return NULL;
}

/* increment count of pending requests */
mca_pml_bsend_count++;
OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);

return buf;
}

It seems that there is a strong hint here that we can wait for the bsend
buffer to become available, and yet mca_pml_ucx_bsend doesn't have a retry
mechanism and just fails on the first attempt. A simple hack to turn
the "if(NULL
== buf) {" into a "while(NULL == buf) {"
in mca_pml_base_bsend_request_alloc_buf seems to support this (the
application proceeds after a few milliseconds)...

Is this hypothesis correct?

Best regards, Martyn


Re: [OMPI users] Fault in not recycling bsend buffer ?

2020-03-17 Thread George Bosilca via users
Martyn,

I don't know exactly what your code is doing, but based on your inquiry I
assume you are using MPI_BSEND multiple times and you run out of local
buffers.

The MPI standard does not mandate a wait until buffer space becomes
available, because that can lead to deadlocks (communication pattern
depends on a local receive that will be posted after the bsend loop).
Instead, the MPI standard states it is the user's responsibility to ensure
enough buffer is available before calling MPI_BSEND, MPI3.2 page 39 line
36, "then MPI must buffer the outgoing message, so as to allow the send to
complete. An error will occur if there is insufficient buffer space". For
blocking buffered sends this is a gray area because from a user perspective
it is difficult to know when you can safely reuse the buffer without
implementing some kind of feedback mechanism to confirm the reception. For
nonblocking the constraint is relaxed as indicated on page 55 line 33,
"Successful return of MPI_WAIT after a MPI_IBSEND implies that the user
buffer can be reused".

In short, you should always make sure you have enough available buffer
space for your buffered sends to be able to locally pack the data to be
sent, or be ready to deal with the error returned by MPI (this part would
not be portable across different MPI implementations).

  George.




On Tue, Mar 17, 2020 at 7:59 AM Martyn Foster via users <
users@lists.open-mpi.org> wrote:

> Hi all,
>
> I'm new here, so please be gentle :-)
>
> Versions: OpenMPI 4.0.3rc1, UCX 1.7
>
> I have a hang in an application (OK for small data sets, but fails with a
> larger one). The error is
>
> "bsend: failed to allocate buffer"
>
> This comes from
>
> pml_ucx.c:693
> mca_pml_ucx_bsend( ... )
> ...
> packed_data = mca_pml_base_bsend_request_alloc_buf(packed_length);
> if (OPAL_UNLIKELY(NULL == packed_data)) {
> OBJ_DESTRUCT(&opal_conv);
> PML_UCX_ERROR( "bsend: failed to allocate buffer");
> return UCS_STATUS_PTR(OMPI_ERROR);
> }
>
> In fact the request appears to be 1.3MB and the bsend buffer is (should
> be!) 128MB
>
> In pml_base_bsend:332
> void*  mca_pml_base_bsend_request_alloc_buf( size_t length )
> {
>void* buf = NULL;
> /* has a buffer been provided */
> OPAL_THREAD_LOCK(&mca_pml_bsend_mutex);
> if(NULL == mca_pml_bsend_addr) {
> OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);
> return NULL;
> }
>
> /* allocate a buffer to hold packed message */
> buf = mca_pml_bsend_allocator->alc_alloc(
> mca_pml_bsend_allocator, length, 0);
> if(NULL == buf) {
> /* release resources when request is freed */
> OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);
> /* progress communications, with the hope that more resources
>  *   will be freed */
> opal_progress();
> return NULL;
> }
>
> /* increment count of pending requests */
> mca_pml_bsend_count++;
> OPAL_THREAD_UNLOCK(&mca_pml_bsend_mutex);
>
> return buf;
> }
>
> It seems that there is a strong hint here that we can wait for the bsend
> buffer to become available, and yet mca_pml_ucx_bsend doesn't have a retry
> mechanism and just fails on the first attempt. A simple hack to turn the 
> "if(NULL
> == buf) {" into a "while(NULL == buf) {"
> in mca_pml_base_bsend_request_alloc_buf seems to support this (the
> application proceeds after a few milliseconds)...
>
> Is this hypothesis correct?
>
> Best regards, Martyn
>
>
>


Re: [OMPI users] Limits of communicator size and number of parallel broadcast transmissions

2020-03-17 Thread George Bosilca via users
On Mon, Mar 16, 2020 at 6:15 PM Konstantinos Konstantinidis via users <
users@lists.open-mpi.org> wrote:

> Hi, I have some questions regarding technical details of MPI collective
> communication methods and broadcast:
>
>- I want to understand when the number of receivers in a MPI_Bcast can
>be a problem slowing down the broadcast.
>
> This is a pretty strong claim. Do you mind sharing with us the data that
allowed you to reach such a conclusion?

>
>- There are a few implementations of MPI_Bcast. Consider that of a
>binary tree. In this case, the sender (root) transmits the common
>message to its two children and each them to two more and so on. Is it
>accurate to say that in each level of the tree all transmissions happen in
>parallel or only one transmission can be done from each node?
>
> Neither of those. Assuming the 2 children are both on different nodes, one
might not want to split the outgoing bandwidth of the parent between the 2
children, but instead order. In this case, one of the children will be
serviced first, while the other one is waiting, and then in a binary tree,
the second is serviced while the first is waiting. So, the binary tree
communication pattern maximize the outgoing bandwidth of some of the nodes,
but not the overall bisectional bandwidth of the machine.

This is not necessarily true for hardware-level collectives. If the
switches propagate the information they might be able to push the data out
to multiple other switches, more or less in parallel.


>
>- To that end, is there a limit on the number of processes a process
>can broadcast to in parallel?
>
> No? I am not sure I understand the context of this question. I would say
that as long as the collective communication is implemented over
point-to-point communications, the limit is 1.

>
>- Since each MPI_Bcast is associated with a communicator is there a
>limit on the number of processes a communicator can have and if so what is
>it in Open MPI?
>
> No, there is no such limit in MPI, or in Open MPI.

  George.



>
> Regards,
> Kostas
>