Please give master a try. This looks like another signature of running out of 
space for shared memory buffers.

-Nathan

> On Jul 13, 2018, at 6:41 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil> 
> wrote:
> 
> Just to summarize for the list.  With Jeff’s prodding I got it generating 
> core files with the debug (and mem-debug) version of openmpi, and below is 
> the kind of stack trace I’m getting from gdb.  It looks slightly different 
> when I use a slightly different implementation that doesn’t use MPI_INPLACE, 
> but nearly the same.  The array that’s being summed is not large, 3776 
> doubles.
> 
> 
> #0  0x0000003160a32495 in raise (sig=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> #1  0x0000003160a33bfd in abort () at abort.c:121
> #2  0x0000000002a3903e in for__issue_diagnostic ()
> #3  0x0000000002a3ff66 in for__signal_handler ()
> #4  <signal handler called>
> #5  0x00002b67a4217029 in mca_btl_vader_check_fboxes () at 
> btl_vader_fbox.h:208
> #6  0x00002b67a421962e in mca_btl_vader_component_progress () at 
> btl_vader_component.c:724
> #7  0x00002b67934fd311 in opal_progress () at runtime/opal_progress.c:229
> #8  0x00002b6792e2f0df in ompi_request_wait_completion (req=0xe863600) at 
> ../ompi/request/request.h:415
> #9  0x00002b6792e2f122 in ompi_request_default_wait (req_ptr=0x7ffebdbb8c20, 
> status=0x0) at request/req_wait.c:42
> #10 0x00002b6792ed7d5a in ompi_coll_base_allreduce_intra_ring (sbuf=0x1, 
> rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, op=0x2b6793192380, 
> comm=0xe14c9c0, module=0xe14f8b0)
>     at base/coll_base_allreduce.c:460
> #11 0x00002b67a6ccb3e2 in ompi_coll_tuned_allreduce_intra_dec_fixed 
> (sbuf=0x1, rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, 
> op=0x2b6793192380, comm=0xe14c9c0, module=0xe14f8b0)
>     at coll_tuned_decision_fixed.c:74
> #12 0x00002b6792e4d9b0 in PMPI_Allreduce (sendbuf=0x1, recvbuf=0xeb79ca0, 
> count=3776, datatype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0) at 
> pallreduce.c:113
> #13 0x00002b6792bb6287 in ompi_allreduce_f (sendbuf=0x1 <Address 0x1 out of 
> bounds>,
>     recvbuf=0xeb79ca0 
> "\310,&AYI\257\276\031\372\214\223\270-y>\207\066\226\003W\f\240\276\334'}\225\376\336\277>\227§\231",
>  count=0x7ffebdbbc4d4, datatype=0x2b48f5c, op=0x2b48f60,
>     comm=0x5a0ae60, ierr=0x7ffebdbb8f60) at pallreduce_f.c:87
> #14 0x000000000042991b in m_sumb_d (comm=..., vec=..., n=Cannot access memory 
> at address 0x928
> ) at mpi.F:870
> #15 m_sum_d (comm=..., vec=..., n=Cannot access memory at address 0x928
> ) at mpi.F:3184
> #16 0x0000000001b22b83 in david::eddav (hamiltonian=..., p=Cannot access 
> memory at address 0x1
> ) at davidson.F:779
> #17 0x0000000001c6ef0e in elmin (hamiltonian=..., kineden=Cannot access 
> memory at address 0x19
> ) at electron.F:424
> #18 0x0000000002a108b2 in electronic_optimization () at main.F:4783
> #19 0x00000000029ec5d3 in vamp () at main.F:2800
> #20 0x00000000004100de in main ()
> #21 0x0000003160a1ed1d in __libc_start_main (main=0x4100b0 <main>, argc=1, 
> ubp_av=0x7ffebdbc5e38, init=<value optimized out>, fini=<value optimized 
> out>, rtld_fini=<value optimized out>,
>     stack_end=0x7ffebdbc5e28) at libc-start.c:226
> #22 0x000000000040ffe9 in _start ()
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

Attachment: signature.asc
Description: Message signed with OpenPGP

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to