Gilles,

I rebuilt MPI with the attached patch and can verify that it appears to have 
fixed the issue I originally reported.


[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-sig-logo.png]<http://www.radiantsolutions.com/>

Clyde Stanfield
Software Engineer
734-480-5100 office<tel://734-480-5100>
clyde.stanfi...@mdaus.com<mailto:clyde.stanfi...@mdaus.com>


[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-twitter-wide.png]<https://twitter.com/radiant_maxar>
 
[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-linkedin-wide.png]
 <https://www.linkedin.com/company/radiant-solutions/>



From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles 
Gouaillardet
Sent: Friday, July 06, 2018 11:16 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] MPI_Ialltoallv

Clyde,

thanks for reporting the issue.

Can you please give the attached patch a try ?


Cheers,

Gilles

FWIW, the nbc module was not initially specific to Open MPI, and hence used 
standard MPI subroutines.
In this case, we can avoid the issue by calling internal Open MPI subroutines.
This is an intermediate patch, since similar issues might potentially occur in 
other places


On Fri, Jul 6, 2018 at 11:12 PM Stanfield, Clyde 
<clyde.stanfi...@radiantsolutions.com<mailto:clyde.stanfi...@radiantsolutions.com>>
 wrote:
We are using MPI_Ialltoallv for an image processing algorithm. When doing this 
we pass in an MPI_Type_contiguous with an MPI_Datatype of MPI_C_FLOAT_COMPLEX 
which ends up being the size of multiple rows of the image (based on the number 
of nodes used for distribution). In addition sendcounts, sdispls, resvcounts, 
and rdispls all fit within a signed int. Usually this works without any issues, 
but when we lower our number of nodes we sometimes see failures.

What we found is that even though we can fit everything into signed ints, line 
528 of nbc_internal.h ends up calling a malloc with an int that appears to be 
the size of the (num_distributed_rows * num_columns  * 
sizeof(std::complex<float>)) which in very large cases wraps back to negative.  
As a result we end up seeing “Error in malloc()” (line 530 of nbc_internal.h) 
throughout our output.

We can get around this issue by ensuring the sum of our contiguous type never 
exceeds 2GB. However, this was unexpected to us as our understanding was that 
all long as we can fit all the parts into signed ints we should be able to 
transfer more than 2GB at a time. Is it intended that MPI_Ialltoallv requires 
your underlying data to be less than 2GB or is this in error in how malloc is 
being called (should be called with a size_t instead of an int)?

Thanks,
Clyde Stanfield

[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-sig-logo.png]<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.radiantsolutions.com_&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=-BclDfm-ZaTikwt3dwncwScP9zGOpGh1zja2cqh4-rU&e=>

Clyde Stanfield
Software Engineer
734-480-5100 office<tel://734-480-5100>
clyde.stanfi...@mdaus.com<mailto:clyde.stanfi...@mdaus.com>


[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-twitter-wide.png]<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_radiant-5Fmaxar&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=63Ohd0Hw9WTJ-7dFIwi3orAuWbAQkx7F8a-4wmAfHfY&e=>
 
[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-linkedin-wide.png]
 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_radiant-2Dsolutions_&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=i6FpAAnjsozHIu2Hp0Nl6Sm6ovjf20AytkgcVIo-rSo&e=>





The information contained in this communication is confidential, is intended 
only for the use of the recipient(s) named above, and may be legally 
privileged. If the reader of this message is not the intended recipient, you 
are hereby notified that any dissemination, distribution, or copying of this 
communication is strictly prohibited. If you have received this communication 
in error, please re-send this communication to the sender and delete the 
original message or any copy of it from your computer system.
_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=hkrwwUBofNjvLmEysf2mwWvWYPc9UFU6_vRKBg68EhM&e=>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to