Re: [OMPI users] MPI_Get is slow with structs containing padding

Antoine Motte via users Tue, 04 Apr 2023 07:38:53 -0700

Hi Nathan, Joseph,

Thank you for your quick answers, I also noticed bad performance ofMPI_Get when there are displacements in the datatype, not necessarilypadding. So I'll keep in mind to declare the padding in my MPI_Datatypeto allow MPI to copy it, and make the whole set of data contiguous inorder to have a single RMA call under the hood.


Bonne journée,

Antoine

Le 2023-03-30 21:25, Nathan Hjelm via users a écrit :

That is exactly the issue. Part of the reason I have argued againstMPI_SHORT_INT usage in RMA because even though it is padded due to typealignment we are still not allowed to operate on the bits between theshort and the int. We can correct that one in the standard by addingthe same language as C (padding bits are undefined) but when a usergives us their own datatype we have no options.

Yes, the best usage for the user is to keep the transfer completelycontiguous. osc/rdma will break it down otherwise and with tcp thatwill be really horrible since each request becomes essentially a BTLactive message.


-Nathan

On Mar 30, 2023, at 1:19 PM, Joseph Schuchart via users<users@lists.open-mpi.org> wrote:


Hi Antoine,

That's an interesting result. I believe the problem with datatypes with
gaps is that MPI is not allowed to touch the gaps. My guess is that for

the RMA version of the benchmark the implementation either has torevert

back to an active message packing the data at the target and sending it
back or (which seems more likely in your case) transfer each object
separately and skip the gaps. Without more information on your setup
(using UCX?) and the benchmark itself (how many elements? what does the
target do?) it's hard to be more precise.

A possible fix would be to drop the MPI datatype for the RMA use and
transfer the vector as a whole, using MPI_BYTE. I think there is also a
way to modify the upper bound of the MPI type to remove the gap, using
MPI_TYPE_CREATE_RESIZED. I expect that that will allow MPI to touch the
gap and transfer the vector as a whole. I'm not sure about the details
there, maybe someone can shed some light.

HTH
Joseph

On 3/30/23 18:34, Antoine Motte via users wrote:

Hello everyone,

I recently had to code an MPI application where I send std::vector
contents in a distributed environment. In order to try different
approaches I coded both 1-sided and 2-sided point-to-point
communication schemes, the first one uses MPI_Window and MPI_Get, the
second one uses MPI_SendRecv.

I had a hard time figuring out why my implementation with MPI_Get was
between 10 and 100 times slower, and I finally found out that MPI_Get
is abnormally slow when one tries to send custom datatypes including
padding.

Here is a short example attached, where I send a struct {double, int}
(12 bytes of data + 4 bytes of padding) vs a struct {double, int, int}
(16 bytes of data, 0 bytes of padding) with both MPI_SendRecv and
MPI_Get. I got these results :

mpirun -np 4 ./compareGetWithSendRecv
{double, int} SendRecv : 0.0303547 s
{double, int} Get : 1.9196 s
{double, int, int} SendRecv : 0.0164659 s
{double, int, int} Get : 0.0147757 s

I run it with both Open MPI 4.1.2 and with intel MPI 2021.6 and got
the same results.

Is this result normal? Do I have any solution other than adding
garbage at the end of the struct or at the end of the MPI_Datatype to
avoid padding?

Regards,

Antoine Motte

Re: [OMPI users] MPI_Get is slow with structs containing padding

Reply via email to