Please give master a try. This looks like another signature of running out of space for shared memory buffers.
-Nathan > On Jul 13, 2018, at 6:41 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil> > wrote: > > Just to summarize for the list. With Jeff’s prodding I got it generating > core files with the debug (and mem-debug) version of openmpi, and below is > the kind of stack trace I’m getting from gdb. It looks slightly different > when I use a slightly different implementation that doesn’t use MPI_INPLACE, > but nearly the same. The array that’s being summed is not large, 3776 > doubles. > > > #0 0x0000003160a32495 in raise (sig=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:64 > #1 0x0000003160a33bfd in abort () at abort.c:121 > #2 0x0000000002a3903e in for__issue_diagnostic () > #3 0x0000000002a3ff66 in for__signal_handler () > #4 <signal handler called> > #5 0x00002b67a4217029 in mca_btl_vader_check_fboxes () at > btl_vader_fbox.h:208 > #6 0x00002b67a421962e in mca_btl_vader_component_progress () at > btl_vader_component.c:724 > #7 0x00002b67934fd311 in opal_progress () at runtime/opal_progress.c:229 > #8 0x00002b6792e2f0df in ompi_request_wait_completion (req=0xe863600) at > ../ompi/request/request.h:415 > #9 0x00002b6792e2f122 in ompi_request_default_wait (req_ptr=0x7ffebdbb8c20, > status=0x0) at request/req_wait.c:42 > #10 0x00002b6792ed7d5a in ompi_coll_base_allreduce_intra_ring (sbuf=0x1, > rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, op=0x2b6793192380, > comm=0xe14c9c0, module=0xe14f8b0) > at base/coll_base_allreduce.c:460 > #11 0x00002b67a6ccb3e2 in ompi_coll_tuned_allreduce_intra_dec_fixed > (sbuf=0x1, rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, > op=0x2b6793192380, comm=0xe14c9c0, module=0xe14f8b0) > at coll_tuned_decision_fixed.c:74 > #12 0x00002b6792e4d9b0 in PMPI_Allreduce (sendbuf=0x1, recvbuf=0xeb79ca0, > count=3776, datatype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0) at > pallreduce.c:113 > #13 0x00002b6792bb6287 in ompi_allreduce_f (sendbuf=0x1 <Address 0x1 out of > bounds>, > recvbuf=0xeb79ca0 > "\310,&AYI\257\276\031\372\214\223\270-y>\207\066\226\003W\f\240\276\334'}\225\376\336\277>\227§\231", > count=0x7ffebdbbc4d4, datatype=0x2b48f5c, op=0x2b48f60, > comm=0x5a0ae60, ierr=0x7ffebdbb8f60) at pallreduce_f.c:87 > #14 0x000000000042991b in m_sumb_d (comm=..., vec=..., n=Cannot access memory > at address 0x928 > ) at mpi.F:870 > #15 m_sum_d (comm=..., vec=..., n=Cannot access memory at address 0x928 > ) at mpi.F:3184 > #16 0x0000000001b22b83 in david::eddav (hamiltonian=..., p=Cannot access > memory at address 0x1 > ) at davidson.F:779 > #17 0x0000000001c6ef0e in elmin (hamiltonian=..., kineden=Cannot access > memory at address 0x19 > ) at electron.F:424 > #18 0x0000000002a108b2 in electronic_optimization () at main.F:4783 > #19 0x00000000029ec5d3 in vamp () at main.F:2800 > #20 0x00000000004100de in main () > #21 0x0000003160a1ed1d in __libc_start_main (main=0x4100b0 <main>, argc=1, > ubp_av=0x7ffebdbc5e38, init=<value optimized out>, fini=<value optimized > out>, rtld_fini=<value optimized out>, > stack_end=0x7ffebdbc5e28) at libc-start.c:226 > #22 0x000000000040ffe9 in _start () > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users