Yeah, it's a bit terrible, but we didn't reliably reproduce this problem for 
many months, either.  :-\

As George noted, it's been ported to all the release branches but is not yet in 
an official release.  Until an official release (4.0.0 just had an rc; it will 
be released soon, and 3.0.3 will have an RC in the immediate future), your best 
bet will be to get a nightly tarball from any of the v2.1.x, v3.0.x, v3.1.x, or 
v4.0.x releases.

My $0.02: if you're just upgrading from Open PI v1.7, you might as well jump up 
to v4.0.x (i.e., don't bother jumping to an older release).



> On Sep 19, 2018, at 9:53 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
> I can't speculate on why you did not notice the memory issue before, simply 
> because for months we (the developers) didn't noticed and our testing 
> infrastructure didn't catch this bug despite running millions of tests. The 
> root cause of the bug was a memory ordering issue, and these are really 
> tricky to identify.
> 
> According to https://github.com/open-mpi/ompi/issues/5638 the patch was 
> backported to all stable releases starting from 2.1. Until their official 
> release however you would either need to get a nightly snapshot or test your 
> luck with master.
> 
>   George.
> 
> 
> On Wed, Sep 19, 2018 at 3:41 AM Patrick Begou 
> <patrick.be...@legi.grenoble-inp.fr> wrote:
> Hi George
> 
> thanks for your answer. I was previously using OpenMPI 3.1.2 and have also 
> this problem. However, using --enable-debug --enable-mem-debug at 
> configuration time, I was unable to reproduce the failure and it was quite 
> difficult for me do trace the problem. May be I have not run enought tests to 
> reach the failure point.
> 
> I fall back to  OpenMPI 2.1.5, thinking the problem was in the 3.x version. 
> The problem was still there but with the debug config I was able to trace the 
> call stack.
> 
> Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning the 
> git repo ?
> 
> Thanks
> 
> Patrick
> 
> George Bosilca wrote:
>> Few days ago we have pushed a fix in master for a strikingly similar issue. 
>> The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x 
>> series. The best path forward will be to migrate to a more recent OMPI 
>> version.
>> 
>> George.
>> 
>> 
>> On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou 
>> <patrick.be...@legi.grenoble-inp.fr> wrote:
>> Hi
>> 
>> I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc 
>> 7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
>> Same binary, same server, same number of processes (16), same parameters for 
>> the run. Sometimes it runs until the end, sometime I get  'invalid memory 
>> reference'.
>> 
>> Building the application and OpenMPI in debug mode I saw that this random 
>> segfault always occur in collective communications inside OpenMPI. I've no 
>> idea howto track this. These are 2 call stack traces (just the openmpi part):
>> 
>> Calling  MPI_ALLREDUCE(...)
>> 
>> Program received signal SIGSEGV: Segmentation fault - invalid memory 
>> reference.
>> 
>> Backtrace for this error:
>> #0  0x7f01937022ef in ???
>> #1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
>>     at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
>> #2  0x7f0192dd0331 in mca_btl_vader_component_progress
>>     at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
>> #3  0x7f0192d6b92b in opal_progress
>>     at ../../opal/runtime/opal_progress.c:226
>> #4  0x7f0194a8a9a4 in sync_wait_st
>>     at ../../opal/threads/wait_sync.h:80
>> #5  0x7f0194a8a9a4 in ompi_request_default_wait_all
>>     at ../../ompi/request/req_wait.c:221
>> #6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
>>     at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
>> #7  0x7f0194aa0a0a in PMPI_Allreduce
>>     at 
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
>> #8  0x7f0194f2e2ba in ompi_allreduce_f
>>     at 
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
>> #9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
>>     at linear_solver_deflation_m.f90:341
>> 
>> 
>> Calling MPI_WAITALL()
>> 
>> Program received signal SIGSEGV: Segmentation fault - invalid memory 
>> reference.
>> 
>> Backtrace for this error:
>> #0  0x7fda5a8d72ef in ???
>> #1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
>>     at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
>> #2  0x7fda59fa5331 in mca_btl_vader_component_progress
>>     at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
>> #3  0x7fda59f4092b in opal_progress
>>     at ../../opal/runtime/opal_progress.c:226
>> #4  0x7fda5bc5f9a4 in sync_wait_st
>>     at ../../opal/threads/wait_sync.h:80
>> #5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
>>     at ../../ompi/request/req_wait.c:221
>> #6  0x7fda5bca329e in PMPI_Waitall
>>     at 
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
>> #7  0x7fda5c10bc00 in ompi_waitall_f
>>     at 
>> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
>> #8  0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
>>     at data_comm_m.f90:5849
>> 
>> 
>> The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at 
>> 207                /* call the registered callback function */
>> 208               reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc, 
>> reg->cbdata);
>> 
>> 
>> OpenMPI 2.1.5 is build with:
>> CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native 
>> -mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
>> ../configure --prefix=$DESTMPI  --enable-mpirun-prefix-by-default 
>> --disable-dlopen \
>> --enable-mca-no-build=openib --without-verbs --enable-mpi-cxx 
>> --without-slurm --enable-mpi-thread-multiple  --enable-debug 
>> --enable-mem-debug
>> 
>> Any help appreciated
>> 
>> Patrick
>> -- 
>> ===================================================================
>> |  Equipe M.O.S.T.         |                                      |
>> |  Patrick BEGOU           | 
>> mailto:patrick.be...@grenoble-inp.fr
>>  |
>> |  LEGI                    |                                      |
>> |  BP 53 X                 | Tel 04 76 82 51 35                   |
>> |  38041 GRENOBLE CEDEX    | Fax 04 76 82 52 71                   |
>> ===================================================================
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> 
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> 
> -- 
> ===================================================================
> |  Equipe M.O.S.T.         |                                      |
> |  Patrick BEGOU           | 
> mailto:patrick.be...@grenoble-inp.fr
>  |
> |  LEGI                    |                                      |
> |  BP 53 X                 | Tel 04 76 82 51 35                   |
> |  38041 GRENOBLE CEDEX    | Fax 04 76 82 52 71                   |
> ===================================================================
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to