Yeah, it's a bit terrible, but we didn't reliably reproduce this problem for many months, either. :-\
As George noted, it's been ported to all the release branches but is not yet in an official release. Until an official release (4.0.0 just had an rc; it will be released soon, and 3.0.3 will have an RC in the immediate future), your best bet will be to get a nightly tarball from any of the v2.1.x, v3.0.x, v3.1.x, or v4.0.x releases. My $0.02: if you're just upgrading from Open PI v1.7, you might as well jump up to v4.0.x (i.e., don't bother jumping to an older release). > On Sep 19, 2018, at 9:53 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > I can't speculate on why you did not notice the memory issue before, simply > because for months we (the developers) didn't noticed and our testing > infrastructure didn't catch this bug despite running millions of tests. The > root cause of the bug was a memory ordering issue, and these are really > tricky to identify. > > According to https://github.com/open-mpi/ompi/issues/5638 the patch was > backported to all stable releases starting from 2.1. Until their official > release however you would either need to get a nightly snapshot or test your > luck with master. > > George. > > > On Wed, Sep 19, 2018 at 3:41 AM Patrick Begou > <patrick.be...@legi.grenoble-inp.fr> wrote: > Hi George > > thanks for your answer. I was previously using OpenMPI 3.1.2 and have also > this problem. However, using --enable-debug --enable-mem-debug at > configuration time, I was unable to reproduce the failure and it was quite > difficult for me do trace the problem. May be I have not run enought tests to > reach the failure point. > > I fall back to OpenMPI 2.1.5, thinking the problem was in the 3.x version. > The problem was still there but with the debug config I was able to trace the > call stack. > > Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning the > git repo ? > > Thanks > > Patrick > > George Bosilca wrote: >> Few days ago we have pushed a fix in master for a strikingly similar issue. >> The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x >> series. The best path forward will be to migrate to a more recent OMPI >> version. >> >> George. >> >> >> On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou >> <patrick.be...@legi.grenoble-inp.fr> wrote: >> Hi >> >> I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc >> 7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults. >> Same binary, same server, same number of processes (16), same parameters for >> the run. Sometimes it runs until the end, sometime I get 'invalid memory >> reference'. >> >> Building the application and OpenMPI in debug mode I saw that this random >> segfault always occur in collective communications inside OpenMPI. I've no >> idea howto track this. These are 2 call stack traces (just the openmpi part): >> >> Calling MPI_ALLREDUCE(...) >> >> Program received signal SIGSEGV: Segmentation fault - invalid memory >> reference. >> >> Backtrace for this error: >> #0 0x7f01937022ef in ??? >> #1 0x7f0192dd0331 in mca_btl_vader_check_fboxes >> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208 >> #2 0x7f0192dd0331 in mca_btl_vader_component_progress >> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689 >> #3 0x7f0192d6b92b in opal_progress >> at ../../opal/runtime/opal_progress.c:226 >> #4 0x7f0194a8a9a4 in sync_wait_st >> at ../../opal/threads/wait_sync.h:80 >> #5 0x7f0194a8a9a4 in ompi_request_default_wait_all >> at ../../ompi/request/req_wait.c:221 >> #6 0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling >> at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225 >> #7 0x7f0194aa0a0a in PMPI_Allreduce >> at >> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107 >> #8 0x7f0194f2e2ba in ompi_allreduce_f >> at >> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87 >> #9 0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg >> at linear_solver_deflation_m.f90:341 >> >> >> Calling MPI_WAITALL() >> >> Program received signal SIGSEGV: Segmentation fault - invalid memory >> reference. >> >> Backtrace for this error: >> #0 0x7fda5a8d72ef in ??? >> #1 0x7fda59fa5331 in mca_btl_vader_check_fboxes >> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208 >> #2 0x7fda59fa5331 in mca_btl_vader_component_progress >> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689 >> #3 0x7fda59f4092b in opal_progress >> at ../../opal/runtime/opal_progress.c:226 >> #4 0x7fda5bc5f9a4 in sync_wait_st >> at ../../opal/threads/wait_sync.h:80 >> #5 0x7fda5bc5f9a4 in ompi_request_default_wait_all >> at ../../ompi/request/req_wait.c:221 >> #6 0x7fda5bca329e in PMPI_Waitall >> at >> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76 >> #7 0x7fda5c10bc00 in ompi_waitall_f >> at >> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104 >> #8 0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1 >> at data_comm_m.f90:5849 >> >> >> The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at >> 207 /* call the registered callback function */ >> 208 reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc, >> reg->cbdata); >> >> >> OpenMPI 2.1.5 is build with: >> CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native >> -mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \ >> ../configure --prefix=$DESTMPI --enable-mpirun-prefix-by-default >> --disable-dlopen \ >> --enable-mca-no-build=openib --without-verbs --enable-mpi-cxx >> --without-slurm --enable-mpi-thread-multiple --enable-debug >> --enable-mem-debug >> >> Any help appreciated >> >> Patrick >> -- >> =================================================================== >> | Equipe M.O.S.T. | | >> | Patrick BEGOU | >> mailto:patrick.be...@grenoble-inp.fr >> | >> | LEGI | | >> | BP 53 X | Tel 04 76 82 51 35 | >> | 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 | >> =================================================================== >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> >> >> _______________________________________________ >> users mailing list >> >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > > -- > =================================================================== > | Equipe M.O.S.T. | | > | Patrick BEGOU | > mailto:patrick.be...@grenoble-inp.fr > | > | LEGI | | > | BP 53 X | Tel 04 76 82 51 35 | > | 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 | > =================================================================== > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users