All,

The subject line of this email is vague, but that's only because I'm not sure 
what is happening.

To wit, I help maintain a code on a cluster where with GNU compilers, we use 
Open MPI. We currently use Open MPI 4.1 because Open MPI 5 just has not worked 
on it.

With Open MPI 4.1, I've run our code on 320 nodes (38400 processes) and it's 
just fine. But with Open MPI 5, if I try and run on 3 nodes, it crashes out. 
And this is 96 processes.

Now, from our tracebacks it seems to die in ESMF, so it's not an "easy" issue 
to pin down. I've run all the OSU Collective microbenchmarks on 4 nodes, and 
that works, so it's not some simple MPI_AllFoo call that's unhappy.

After throwing all the debugging flags I could, eventually I got a traceback 
that went deep down and showed:

#9  0x14d5ffe01aab in _Znwm
        at 
/usr/local/other/SRC/gcc/gcc-14.2.0/libstdc++-v3/libsupc++/new_op.cc:50
#10  0x14d61793428f in ???
#11  0x14d61793196e in 
_ZNSt16allocator_traitsISaIN5ESMCI8NODE_PNTEEE8allocateERS2_m
        at 
/gpfsm/dulocal15/sles15/other/gcc/14.2.0/include/c++/14.2.0/bits/alloc_traits.h:478
#12  0x14d61793196e in 
_ZNSt12_Vector_baseIN5ESMCI8NODE_PNTESaIS1_EE11_M_allocateEm
        at 
/gpfsm/dulocal15/sles15/other/gcc/14.2.0/include/c++/14.2.0/bits/stl_vector.h:380
#13  0x14d6179302dd in _ZNSt6vectorIN5ESMCI8NODE_PNTESaIS1_EE7reserveEm
        at 
/gpfsm/dulocal15/sles15/other/gcc/14.2.0/include/c++/14.2.0/bits/vector.tcc:79
#14  0x14d61792dd85 in _search_exact_point_match
        at 
/gpfsm/dswdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.27.0/src/esmf/src/Infrastructure/Mesh/src/Regridding/ESMCI_Search.C:1031
#15  0x14d61792f1ef in 
_ZN5ESMCI9OctSearchERKNS_4MeshERNS_9PointListENS_8MAP_TYPEEjiRSt6vectorIPNS_13Search_resultESaIS8_EEbRNS_4WMatEd
        at 
/gpfsm/dswdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.27.0/src/esmf/src/Infrastructure/Mesh/src/Regridding/ESMCI_Search.C:1362


Thinking "that looks malloc-ish", I remembered our code has an "experimental" 
support for JeMalloc (experimental in that we don't often run with it).

As a final Hail Mary, I decided to build the code linking to JeMalloc and...it 
runs on 4 nodes!

So, I was wondering if this sort of oddity reminded anyone of something? I'll 
note that I am building Open MPI 5 pretty boringly:

  Configure command line: '--disable-wrapper-rpath'
                          '--disable-wrapper-runpath' '--with-slurm'
                          '--with-hwloc=internal' '--with-libevent=internal'
                          '--with-pmix=internal' '--disable-libxml2'

I've also tried system UCX (1.14) as well as a hand-built UCX (1.18) and the 
behavior seems to occur with both. (Though at this point I've tried so many 
things I might have not covered all combinations).

Thanks for any thoughts,
Matt



[signature_1058356833]<http://www.ssaihq.com/>

Matt Thompson
Lead Scientific Software Engineer/Supervisor
Global Modeling and Assimilation Office
Science Systems and Applications, Inc.
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
o: 301-614-6712
matthew.thomp...@nasa.gov<mailto:matthew.thomp...@nasa.gov>

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Reply via email to