> On Jul 16, 2018, at 8:34 AM, Noam Bernstein <noam.bernst...@nrl.navy.mil > <mailto:noam.bernst...@nrl.navy.mil>> wrote: > >> On Jul 14, 2018, at 1:31 AM, Nathan Hjelm via users >> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: >> >> Please give master a try. This looks like another signature of running out >> of space for shared memory buffers. > > Sorry, I wasn’t explicit on this point - I’m already using master, > specifically > openmpi-master-201807120327-34bc777.tar.gz
And a bit more data on the stack traces, since the problem is non-deterministic. I’ve run 30 sets of 10 iterations of the code, and 8 crashed. In every case the final part of the stack trace was Program terminated with signal 6, Aborted. #0 0x0000003f5a432495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); #0 0x0000003f5a432495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003f5a433bfd in abort () at abort.c:121 #2 0x0000000002a3985e in for__issue_diagnostic () #3 0x0000000002a40786 in for__signal_handler () #4 <signal handler called> #5 0x00002ae37088f029 in mca_btl_vader_check_fboxes () at btl_vader_fbox.h:208 #6 0x00002ae37089162e in mca_btl_vader_component_progress () at btl_vader_component.c:724 #7 0x00002ae35fa41311 in opal_progress () at runtime/opal_progress.c:229 #8 0x00002ae3724a11b7 in ompi_request_wait_completion (req=0xd2a4700) at ../../../../ompi/request/request.h:415 with some variation in the routines that lead to this point. In all cases the mpi call was some all to all routine, all but one “opmi_allreduce_f", and one "ompi_alltoallv_z”. I can of course post all 8 stack traces if that’s useful. Noam
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users