Hi, Perhaps related — we’re seeing this one with 3.1.1. I’ll see if I can get the application run against our --enable-debug build.
Cheers, Ben [raijin7:1943 :0:1943] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x45) /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c: [ append_frag_to_ordered_list() ] ... 118 * account for this rollover or the matching will fail. 119 * Extract the items from the list to order them safely */ 120 if( hdr->hdr_seq < prior->hdr.hdr_match.hdr_seq ) { ==> 121 uint16_t d1, d2 = prior->hdr.hdr_match.hdr_seq - hdr->hdr_seq; 122 do { 123 d1 = d2; 124 prior = (mca_pml_ob1_recv_frag_t*)(prior->super.super.opal_list_prev); ==== backtrace ==== 0 0x0000000000012d5f append_frag_to_ordered_list() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:121 1 0x0000000000013a06 mca_pml_ob1_recv_frag_callback_match() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:390 2 0x00000000000044ef mca_btl_vader_check_fboxes() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/mca/btl/vader/../../../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208 3 0x000000000000602f mca_btl_vader_component_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/mca/btl/vader/../../../../../../../opal/mca/btl/vader/btl_vader_component.c:689 4 0x000000000002b554 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228 5 0x00000000000331cc ompi_sync_wait_mt() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/../../../../opal/threads/wait_sync.c:85 6 0x000000000004a989 ompi_request_wait_completion() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/../../../../ompi/request/request.h:403 7 0x000000000004aa1d ompi_request_default_wait() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:42 8 0x00000000000d3486 ompi_coll_base_sendrecv_actual() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_util.c:59 9 0x00000000000d0d2b ompi_coll_base_sendrecv() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_util.h:67 10 0x00000000000d14c7 ompi_coll_base_allgather_intra_recursivedoubling() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_allgather.c:329 11 0x00000000000056dc ompi_coll_tuned_allgather_intra_dec_fixed() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:551 12 0x000000000006185d PMPI_Allgather() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mpi/c/profile/pallgather.c:122 13 0x000000000004362c ompi_allgather_f() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/intel/debug-0/ompi/mpi/fortran/mpif-h/profile/pallgather_f.c:86 14 0x00000000005ed3cb comms_allgather_integer_0() /short/z00/aab900/onetep/src/comms_mod.F90:14795 15 0x0000000001309fe1 multigrid_bc_for_dlmg() /short/z00/aab900/onetep/src/multigrid_methods_mod.F90:270 16 0x0000000001309fe1 multigrid_initialise() /short/z00/aab900/onetep/src/multigrid_methods_mod.F90:174 17 0x0000000000f0c885 hartree_via_multigrid() /short/z00/aab900/onetep/src/hartree_mod.F90:181 18 0x0000000000a0c62a electronic_init_pot() /short/z00/aab900/onetep/src/electronic_init_mod.F90:1123 19 0x0000000000a14d62 electronic_init_denskern() /short/z00/aab900/onetep/src/electronic_init_mod.F90:334 20 0x0000000000a50136 energy_and_force_calculate() /short/z00/aab900/onetep/src/energy_and_force_mod.F90:1702 21 0x00000000014f46e7 onetep() /short/z00/aab900/onetep/src/onetep.F90:277 22 0x000000000041465e main() ???:0 23 0x000000000001ed1d __libc_start_main() ???:0 24 0x0000000000414569 _start() ???:0 =================== ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ------------------------------------------------------- forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source onetep.nci 0000000001DCC6DE Unknown Unknown Unknown libpthread-2.12.s 00002B6D46ED07E0 Unknown Unknown Unknown libmlx4-rdmav2.so 00002B6D570E3B18 Unknown Unknown Unknown -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node raijin7 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- > On 12 Jul 2018, at 8:16 am, Nathan Hjelm via users <users@lists.open-mpi.org> > wrote: > > Might be also worth testing a master snapshot and see if that fixes the > issue. There are a couple of fixes being backported from master to v3.0.x and > v3.1.x now. > > -Nathan > > On Jul 11, 2018, at 03:16 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil> > wrote: > >>> On Jul 11, 2018, at 11:29 AM, Jeff Squyres (jsquyres) via users >>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: >>> Ok, that would be great -- thanks. >>> >>> Recompiling Open MPI with --enable-debug will turn on several >>> debugging/sanity checks inside Open MPI, and it will also enable debugging >>> symbols. Hence, If you can get a failure when a debug Open MPI build, it >>> might give you a core file that can be used to get a more detailed stack >>> trace, poke around and see if there's a NULL pointer somewhere, …etc. >> >> I haven’t tried to get a core file yes, but it’s not producing any more info >> from the runtime stack trace, despite configure with —enable-debug: >> >> Image PC Routine Line Source >> vasp.gamma_para.i 0000000002DCE8C1 Unknown Unknown Unknown >> vasp.gamma_para.i 0000000002DCC9FB Unknown Unknown Unknown >> vasp.gamma_para.i 0000000002D409E4 Unknown Unknown Unknown >> vasp.gamma_para.i 0000000002D407F6 Unknown Unknown Unknown >> vasp.gamma_para.i 0000000002CDCED9 Unknown Unknown Unknown >> vasp.gamma_para.i 0000000002CE3DB6 Unknown Unknown Unknown >> libpthread-2.12.s 0000003F8E60F7E0 Unknown Unknown Unknown >> mca_btl_vader.so 00002B1AFA5FAC30 Unknown Unknown Unknown >> mca_btl_vader.so 00002B1AFA5FD00D Unknown Unknown Unknown >> libopen-pal.so.40 00002B1AE884327C opal_progress Unknown Unknown >> mca_pml_ob1.so 00002B1AFB855DCE Unknown Unknown Unknown >> mca_pml_ob1.so 00002B1AFB858305 mca_pml_ob1_send Unknown Unknown >> libmpi.so.40.10.1 00002B1AE823A5DA ompi_coll_base_al Unknown Unknown >> mca_coll_tuned.so 00002B1AFC6F0842 ompi_coll_tuned_a Unknown Unknown >> libmpi.so.40.10.1 00002B1AE81B66F5 PMPI_Allreduce Unknown Unknown >> libmpi_mpifh.so.4 00002B1AE7F2259B mpi_allreduce_ Unknown Unknown >> vasp.gamma_para.i 000000000042D1ED m_sum_d_ 1300 mpi.F >> vasp.gamma_para.i 000000000089947D nonl_mp_vnlacc_.R 1754 nonl.F >> vasp.gamma_para.i 0000000000972C51 hamil_mp_hamiltmu 825 hamil.F >> vasp.gamma_para.i 0000000001BD2608 david_mp_eddav_.R 419 >> davidson.F >> vasp.gamma_para.i 0000000001D2179E elmin_.R 424 >> electron.F >> vasp.gamma_para.i 0000000002B92452 vamp_IP_electroni 4783 main.F >> vasp.gamma_para.i 0000000002B6E173 MAIN__ 2800 main.F >> vasp.gamma_para.i 000000000041325E Unknown Unknown Unknown >> libc-2.12.so 0000003F8E21ED1D __libc_start_main Unknown Unknown >> vasp.gamma_para.i 0000000000413169 Unknown Unknown Unknown >> >> This is the configure line that was supposedly used to create the library: >> ./configure >> --prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080 >> --with-tm=/usr/local/torque --enable-mpirun-prefix-by-default >> --with-verbs=/usr --with-verbs-libdir=/usr/lib64 --enable-debug >> >> Is there any way I can confirm that the version of the openmpi library I >> think I’m using really was compiled with debugging? >> >> Noam >> >> >> ____________ >> >> | >> | >> >> |U.S. NAVAL| >> >> |_RESEARCH_| >> >> >> LABORATORY >> >> >> Noam Bernstein, Ph.D. >> Center for Materials Physics and Technology >> U.S. Naval Research Laboratory >> T +1 202 404 8628 F +1 202 404 7546 >> https://www.nrl.navy.mil <https://www.nrl.navy.mil/> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users>_______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users