Gilles, I rebuilt MPI with the attached patch and can verify that it appears to have fixed the issue I originally reported.
[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-sig-logo.png]<http://www.radiantsolutions.com/> Clyde Stanfield Software Engineer 734-480-5100 office<tel://734-480-5100> clyde.stanfi...@mdaus.com<mailto:clyde.stanfi...@mdaus.com> [https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-twitter-wide.png]<https://twitter.com/radiant_maxar> [https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-linkedin-wide.png] <https://www.linkedin.com/company/radiant-solutions/> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles Gouaillardet Sent: Friday, July 06, 2018 11:16 AM To: Open MPI Users <users@lists.open-mpi.org> Subject: Re: [OMPI users] MPI_Ialltoallv Clyde, thanks for reporting the issue. Can you please give the attached patch a try ? Cheers, Gilles FWIW, the nbc module was not initially specific to Open MPI, and hence used standard MPI subroutines. In this case, we can avoid the issue by calling internal Open MPI subroutines. This is an intermediate patch, since similar issues might potentially occur in other places On Fri, Jul 6, 2018 at 11:12 PM Stanfield, Clyde <clyde.stanfi...@radiantsolutions.com<mailto:clyde.stanfi...@radiantsolutions.com>> wrote: We are using MPI_Ialltoallv for an image processing algorithm. When doing this we pass in an MPI_Type_contiguous with an MPI_Datatype of MPI_C_FLOAT_COMPLEX which ends up being the size of multiple rows of the image (based on the number of nodes used for distribution). In addition sendcounts, sdispls, resvcounts, and rdispls all fit within a signed int. Usually this works without any issues, but when we lower our number of nodes we sometimes see failures. What we found is that even though we can fit everything into signed ints, line 528 of nbc_internal.h ends up calling a malloc with an int that appears to be the size of the (num_distributed_rows * num_columns * sizeof(std::complex<float>)) which in very large cases wraps back to negative. As a result we end up seeing “Error in malloc()” (line 530 of nbc_internal.h) throughout our output. We can get around this issue by ensuring the sum of our contiguous type never exceeds 2GB. However, this was unexpected to us as our understanding was that all long as we can fit all the parts into signed ints we should be able to transfer more than 2GB at a time. Is it intended that MPI_Ialltoallv requires your underlying data to be less than 2GB or is this in error in how malloc is being called (should be called with a size_t instead of an int)? Thanks, Clyde Stanfield [https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-sig-logo.png]<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.radiantsolutions.com_&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=-BclDfm-ZaTikwt3dwncwScP9zGOpGh1zja2cqh4-rU&e=> Clyde Stanfield Software Engineer 734-480-5100 office<tel://734-480-5100> clyde.stanfi...@mdaus.com<mailto:clyde.stanfi...@mdaus.com> [https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-twitter-wide.png]<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_radiant-5Fmaxar&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=63Ohd0Hw9WTJ-7dFIwi3orAuWbAQkx7F8a-4wmAfHfY&e=> [https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-linkedin-wide.png] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_radiant-2Dsolutions_&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=i6FpAAnjsozHIu2Hp0Nl6Sm6ovjf20AytkgcVIo-rSo&e=> The information contained in this communication is confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please re-send this communication to the sender and delete the original message or any copy of it from your computer system. _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=hkrwwUBofNjvLmEysf2mwWvWYPc9UFU6_vRKBg68EhM&e=>
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users