Hi Nathan, Joseph,

Thank you for your quick answers, I also noticed bad performance of MPI_Get when there are displacements in the datatype, not necessarily padding. So I'll keep in mind to declare the padding in my MPI_Datatype to allow MPI to copy it, and make the whole set of data contiguous in order to have a single RMA call under the hood.

Bonne journée,

Antoine

Le 2023-03-30 21:25, Nathan Hjelm via users a écrit :

That is exactly the issue. Part of the reason I have argued against MPI_SHORT_INT usage in RMA because even though it is padded due to type alignment we are still not allowed to operate on the bits between the short and the int. We can correct that one in the standard by adding the same language as C (padding bits are undefined) but when a user gives us their own datatype we have no options.

Yes, the best usage for the user is to keep the transfer completely contiguous. osc/rdma will break it down otherwise and with tcp that will be really horrible since each request becomes essentially a BTL active message.

-Nathan

On Mar 30, 2023, at 1:19 PM, Joseph Schuchart via users <users@lists.open-mpi.org> wrote:

Hi Antoine,

That's an interesting result. I believe the problem with datatypes with
gaps is that MPI is not allowed to touch the gaps. My guess is that for
the RMA version of the benchmark the implementation either has to revert
back to an active message packing the data at the target and sending it
back or (which seems more likely in your case) transfer each object
separately and skip the gaps. Without more information on your setup
(using UCX?) and the benchmark itself (how many elements? what does the
target do?) it's hard to be more precise.

A possible fix would be to drop the MPI datatype for the RMA use and
transfer the vector as a whole, using MPI_BYTE. I think there is also a
way to modify the upper bound of the MPI type to remove the gap, using
MPI_TYPE_CREATE_RESIZED. I expect that that will allow MPI to touch the
gap and transfer the vector as a whole. I'm not sure about the details
there, maybe someone can shed some light.

HTH
Joseph

On 3/30/23 18:34, Antoine Motte via users wrote:

Hello everyone,

I recently had to code an MPI application where I send std::vector
contents in a distributed environment. In order to try different
approaches I coded both 1-sided and 2-sided point-to-point
communication schemes, the first one uses MPI_Window and MPI_Get, the
second one uses MPI_SendRecv.

I had a hard time figuring out why my implementation with MPI_Get was
between 10 and 100 times slower, and I finally found out that MPI_Get
is abnormally slow when one tries to send custom datatypes including
padding.

Here is a short example attached, where I send a struct {double, int}
(12 bytes of data + 4 bytes of padding) vs a struct {double, int, int}
(16 bytes of data, 0 bytes of padding) with both MPI_SendRecv and
MPI_Get. I got these results :

mpirun -np 4 ./compareGetWithSendRecv
{double, int} SendRecv : 0.0303547 s
{double, int} Get : 1.9196 s
{double, int, int} SendRecv : 0.0164659 s
{double, int, int} Get : 0.0147757 s

I run it with both Open MPI 4.1.2 and with intel MPI 2021.6 and got
the same results.

Is this result normal? Do I have any solution other than adding
garbage at the end of the struct or at the end of the MPI_Datatype to
avoid padding?

Regards,

Antoine Motte

Reply via email to