> On Jul 16, 2018, at 8:34 AM, Noam Bernstein <[email protected] > <mailto:[email protected]>> wrote: > >> On Jul 14, 2018, at 1:31 AM, Nathan Hjelm via users >> <[email protected] <mailto:[email protected]>> wrote: >> >> Please give master a try. This looks like another signature of running out >> of space for shared memory buffers. > > Sorry, I wasn’t explicit on this point - I’m already using master, > specifically > openmpi-master-201807120327-34bc777.tar.gz
And a bit more data on the stack traces, since the problem is
non-deterministic. I’ve run 30 sets of 10 iterations of the code, and 8
crashed. In every case the final part of the stack trace was
Program terminated with signal 6, Aborted.
#0 0x0000003f5a432495 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
#0 0x0000003f5a432495 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003f5a433bfd in abort () at abort.c:121
#2 0x0000000002a3985e in for__issue_diagnostic ()
#3 0x0000000002a40786 in for__signal_handler ()
#4 <signal handler called>
#5 0x00002ae37088f029 in mca_btl_vader_check_fboxes () at btl_vader_fbox.h:208
#6 0x00002ae37089162e in mca_btl_vader_component_progress () at
btl_vader_component.c:724
#7 0x00002ae35fa41311 in opal_progress () at runtime/opal_progress.c:229
#8 0x00002ae3724a11b7 in ompi_request_wait_completion (req=0xd2a4700) at
../../../../ompi/request/request.h:415
with some variation in the routines that lead to this point. In all cases the
mpi call was some all to all routine, all but one “opmi_allreduce_f", and one
"ompi_alltoallv_z”.
I can of course post all 8 stack traces if that’s useful.
Noam
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ users mailing list [email protected] https://lists.open-mpi.org/mailman/listinfo/users
