Ralph -- This change was not correct (https://github.com/open-mpi/ompi/commit/ce915b5757d428d3e914dcef50bd4b2636561bca). It is causing memory corruption in the openib BTL.
> On May 25, 2015, at 11:56 AM, Ralph Castain <r...@open-mpi.org> wrote: > > I don’t see a problem with it. FWIW: I’m getting ready to release 1.8.6 in > the next week > > >> On May 25, 2015, at 8:46 AM, Xavier Besseron <xavier.besse...@uni.lu> wrote: >> >> Good that it will be fixed in the next release! >> >> In the meantime, and because it might impact other users, >> I would like to ask my sysadmins to set btl_openib_memalign_threshold=12288 >> in etc/openmpi-mca-params.conf on our clusters. >> >> Do you see any good reason not doing it? >> >> Thanks! >> >> >> Xavier >> >> >> >> On Mon, May 25, 2015 at 4:12 PM, Ralph Castain <r...@open-mpi.org> wrote: >> I found the problem - someone had a typo in btl_openib_mca.c. The threshold >> need to be set to the module eager limit as that is the only thing defined >> at that point. >> >> Thanks for bringing it to our attention! I’ll set it up to go into 1.8.6 >> >> >>> On May 25, 2015, at 3:04 AM, Xavier Besseron <xavier.besse...@uni.lu> wrote: >>> >>> Hi, >>> >>> Thanks for your reply Ralph. >>> >>> The option only I'm using when configuring OpenMPI is '--prefix'. >>> When checking the config.log file, I see >>> >>> configure:208504: checking whether the openib BTL will use malloc hooks >>> configure:208510: result: yes >>> >>> so I guess it is properly enabled (full config.log in attachment of this >>> email). >>> >>> >>> >>> However, I think I have the reason of the bug (lines refer to source code >>> of OpenMPI 1.8.5): >>> >>> The default value of memalign_threshold is taken from eager_limit in >>> function btl_openib_register_mca_params() in btl_openib_mca.c line 717. >>> But the default value is eager_limit is set in btl_openib_component.c at >>> line 193 right after the call to btl_openib_register_mca_params(). >>> >>> To summarize, memalign_threshold gets its value from eager_limit before >>> this one gets its value assigned. >>> >>> >>> >>> Best regards, >>> >>> Xavier >>> >>> >>> >>> >>> >>> >>> >>> >>> On Mon, May 25, 2015 at 2:27 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> Looking at the code, we do in fact set the memalign_threshold = eager_limit >>> by default, but only if you configured with >>> —enable-btl-openib-malloc-alignment AND/OR we found the malloc hook >>> functions were available. >>> >>> You might check config.log to see if the openib malloc hooks were enabled. >>> My guess is that they weren’t, for some reason. >>> >>> >>>> On May 24, 2015, at 9:07 AM, Xavier Besseron <xavier.besse...@uni.lu> >>>> wrote: >>>> >>>> Dear OpenMPI developers / users, >>>> >>>> This is much more a comment than a question since I believe I have already >>>> solved my issue. But I would like to report it. >>>> >>>> I have noticed my code performed very badly with OpenMPI when Infinand is >>>> enabled, sometime +50% or even +100% overhead. >>>> I also have this slowdown when running with one thread and one process. In >>>> such case, there is no other MPI call than MPI_Init() and MPI_Finalize(). >>>> This overhead disappears if I disable at runtime the openib btl, ie with >>>> '--mca btl ^openib'. >>>> After further investigation, I figured out it comes from the memory >>>> allocator which is aligning every memory allocation when Infiniband is >>>> used. >>>> This makes sense because my code is a large irregular C++ code creating >>>> and deleting many objects. >>>> >>>> Just below is the documentation of the relevant MCA parameters coming >>>> ompi_info: >>>> >>>> MCA btl: parameter "btl_openib_memalign" (current value: "32", data >>>> source: default, level: 9 dev/all, type: int) >>>> [64 | 32 | 0] - Enable (64bit or 32bit)/Disable(0) >>>> memoryalignment for all malloc calls if btl openib is used. >>>> >>>> MCA btl: parameter "btl_openib_memalign_threshold" (current value: "0", >>>> data source: default, level: 9 dev/all, type: size_t) >>>> Allocating memory more than btl_openib_memalign_threshholdbytes >>>> will automatically be algined to the value of btl_openib_memalign >>>> bytes.memalign_threshhold defaults to the same value as >>>> mca_btl_openib_eager_limit. >>>> >>>> MCA btl: parameter "btl_openib_eager_limit" (current value: "12288", data >>>> source: default, level: 4 tuner/basic, type: size_t) >>>> Maximum size (in bytes, including header) of "short" messages >>>> (must be >= 1). >>>> >>>> >>>> In the end, the problem is that the default value for >>>> btl_openib_memalign_threshold is 0, which means that all memory >>>> allocations are aligned to 32 bits. >>>> The documentation says that the default value of >>>> btl_openib_memalign_threshold should be the the same as >>>> btl_openib_eager_limit, ie 12288 instead of 0. >>>> >>>> On my side, changing btl_openib_memalign_threshold to 12288 fixes my >>>> performance issue. >>>> However, I believe that the default value of btl_openib_memalign_threshold >>>> should be fixed in the OpenMPI code (or at least the documentation should >>>> be fixed). >>>> >>>> I tried OpenMPI 1.8.5, 1.7.3 and 1.6.4 and it's all the same. >>>> >>>> >>>> Bonus question: >>>> As this issue might impact other users, I'm considering applying a global >>>> fix on our clusters by setting this default value >>>> etc/openmpi-mca-params.conf. >>>> Do you see any good reason not doing it? >>>> >>>> Thank you for your comments. >>>> >>>> Best regards, >>>> >>>> Xavier >>>> >>>> >>>> -- >>>> Dr Xavier BESSERON >>>> Research associate >>>> FSTC, University of Luxembourg >>>> Campus Kirchberg, Office E-007 >>>> Phone: +352 46 66 44 5418 >>>> http://luxdem.uni.lu/ >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/05/26913.php >>> >>> >>> >>> >>> -- >>> Dr Xavier BESSERON >>> Research associate >>> FSTC, University of Luxembourg >>> Campus Kirchberg, Office E-007 >>> Phone: +352 46 66 44 5418 >>> http://luxdem.uni.lu/ >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/05/26915.php >> >> >> >> >> -- >> Dr Xavier BESSERON >> Research associate >> FSTC, University of Luxembourg >> Campus Kirchberg, Office E-007 >> Phone: +352 46 66 44 5418 >> http://luxdem.uni.lu/ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/05/26918.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26920.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/