Just to summarize for the list. With Jeff’s prodding I got it generating core files with the debug (and mem-debug) version of openmpi, and below is the kind of stack trace I’m getting from gdb. It looks slightly different when I use a slightly different implementation that doesn’t use MPI_INPLACE, but nearly the same. The array that’s being summed is not large, 3776 doubles.
#0 0x0000003160a32495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003160a33bfd in abort () at abort.c:121 #2 0x0000000002a3903e in for__issue_diagnostic () #3 0x0000000002a3ff66 in for__signal_handler () #4 <signal handler called> #5 0x00002b67a4217029 in mca_btl_vader_check_fboxes () at btl_vader_fbox.h:208 #6 0x00002b67a421962e in mca_btl_vader_component_progress () at btl_vader_component.c:724 #7 0x00002b67934fd311 in opal_progress () at runtime/opal_progress.c:229 #8 0x00002b6792e2f0df in ompi_request_wait_completion (req=0xe863600) at ../ompi/request/request.h:415 #9 0x00002b6792e2f122 in ompi_request_default_wait (req_ptr=0x7ffebdbb8c20, status=0x0) at request/req_wait.c:42 #10 0x00002b6792ed7d5a in ompi_coll_base_allreduce_intra_ring (sbuf=0x1, rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0, module=0xe14f8b0) at base/coll_base_allreduce.c:460 #11 0x00002b67a6ccb3e2 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x1, rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0, module=0xe14f8b0) at coll_tuned_decision_fixed.c:74 #12 0x00002b6792e4d9b0 in PMPI_Allreduce (sendbuf=0x1, recvbuf=0xeb79ca0, count=3776, datatype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0) at pallreduce.c:113 #13 0x00002b6792bb6287 in ompi_allreduce_f (sendbuf=0x1 <Address 0x1 out of bounds>, recvbuf=0xeb79ca0 "\310,&AYI\257\276\031\372\214\223\270-y>\207\066\226\003W\f\240\276\334'}\225\376\336\277>\227§\231", count=0x7ffebdbbc4d4, datatype=0x2b48f5c, op=0x2b48f60, comm=0x5a0ae60, ierr=0x7ffebdbb8f60) at pallreduce_f.c:87 #14 0x000000000042991b in m_sumb_d (comm=..., vec=..., n=Cannot access memory at address 0x928 ) at mpi.F:870 #15 m_sum_d (comm=..., vec=..., n=Cannot access memory at address 0x928 ) at mpi.F:3184 #16 0x0000000001b22b83 in david::eddav (hamiltonian=..., p=Cannot access memory at address 0x1 ) at davidson.F:779 #17 0x0000000001c6ef0e in elmin (hamiltonian=..., kineden=Cannot access memory at address 0x19 ) at electron.F:424 #18 0x0000000002a108b2 in electronic_optimization () at main.F:4783 #19 0x00000000029ec5d3 in vamp () at main.F:2800 #20 0x00000000004100de in main () #21 0x0000003160a1ed1d in __libc_start_main (main=0x4100b0 <main>, argc=1, ubp_av=0x7ffebdbc5e38, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7ffebdbc5e28) at libc-start.c:226 #22 0x000000000040ffe9 in _start ()
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users