All, The subject line of this email is vague, but that's only because I'm not sure what is happening.
To wit, I help maintain a code on a cluster where with GNU compilers, we use Open MPI. We currently use Open MPI 4.1 because Open MPI 5 just has not worked on it. With Open MPI 4.1, I've run our code on 320 nodes (38400 processes) and it's just fine. But with Open MPI 5, if I try and run on 3 nodes, it crashes out. And this is 96 processes. Now, from our tracebacks it seems to die in ESMF, so it's not an "easy" issue to pin down. I've run all the OSU Collective microbenchmarks on 4 nodes, and that works, so it's not some simple MPI_AllFoo call that's unhappy. After throwing all the debugging flags I could, eventually I got a traceback that went deep down and showed: #9 0x14d5ffe01aab in _Znwm at /usr/local/other/SRC/gcc/gcc-14.2.0/libstdc++-v3/libsupc++/new_op.cc:50 #10 0x14d61793428f in ??? #11 0x14d61793196e in _ZNSt16allocator_traitsISaIN5ESMCI8NODE_PNTEEE8allocateERS2_m at /gpfsm/dulocal15/sles15/other/gcc/14.2.0/include/c++/14.2.0/bits/alloc_traits.h:478 #12 0x14d61793196e in _ZNSt12_Vector_baseIN5ESMCI8NODE_PNTESaIS1_EE11_M_allocateEm at /gpfsm/dulocal15/sles15/other/gcc/14.2.0/include/c++/14.2.0/bits/stl_vector.h:380 #13 0x14d6179302dd in _ZNSt6vectorIN5ESMCI8NODE_PNTESaIS1_EE7reserveEm at /gpfsm/dulocal15/sles15/other/gcc/14.2.0/include/c++/14.2.0/bits/vector.tcc:79 #14 0x14d61792dd85 in _search_exact_point_match at /gpfsm/dswdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.27.0/src/esmf/src/Infrastructure/Mesh/src/Regridding/ESMCI_Search.C:1031 #15 0x14d61792f1ef in _ZN5ESMCI9OctSearchERKNS_4MeshERNS_9PointListENS_8MAP_TYPEEjiRSt6vectorIPNS_13Search_resultESaIS8_EEbRNS_4WMatEd at /gpfsm/dswdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.27.0/src/esmf/src/Infrastructure/Mesh/src/Regridding/ESMCI_Search.C:1362 Thinking "that looks malloc-ish", I remembered our code has an "experimental" support for JeMalloc (experimental in that we don't often run with it). As a final Hail Mary, I decided to build the code linking to JeMalloc and...it runs on 4 nodes! So, I was wondering if this sort of oddity reminded anyone of something? I'll note that I am building Open MPI 5 pretty boringly: Configure command line: '--disable-wrapper-rpath' '--disable-wrapper-runpath' '--with-slurm' '--with-hwloc=internal' '--with-libevent=internal' '--with-pmix=internal' '--disable-libxml2' I've also tried system UCX (1.14) as well as a hand-built UCX (1.18) and the behavior seems to occur with both. (Though at this point I've tried so many things I might have not covered all combinations). Thanks for any thoughts, Matt [signature_1058356833]<http://www.ssaihq.com/> Matt Thompson Lead Scientific Software Engineer/Supervisor Global Modeling and Assimilation Office Science Systems and Applications, Inc. Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771 o: 301-614-6712 matthew.thomp...@nasa.gov<mailto:matthew.thomp...@nasa.gov> To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.