Hi OpenMPI users - I’m trying to debug a non-deterministic crash, apparently in
opal_progress, with OpenMPI 3.1.0. All of them seem to involve mpi_allreduce,
although it’s different particular calls from this code (VASP), and they seem
more frequent for larger core/mpi task counts (128 happens within a few
minutes, 5-200 iterations of the code, while 16 cores run thousands of
iterations without the problem happening). The tail end of the stack trace
looks like
libopen-pal.so.40 00002AC204D2B890 opal_progress Unknown Unknown
libmpi.so.40.10.0 00002AC2047E6CEA ompi_coll_base_se Unknown Unknown
libmpi.so.40.10.0 00002AC2047E81B3 ompi_coll_base_al Unknown Unknown
libmpi.so.40.10.0 00002AC20479EF7F PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00002AC2045301D7 mpi_allreduce_ Unknown Unknown
or
libopen-pal.so.40 00002AD2F1B94890 opal_progress Unknown Unknown
libmpi.so.40.10.0 00002AD2F15F678D ompi_request_defa Unknown Unknown
libmpi.so.40.10.0 00002AD2F164FD00 ompi_coll_base_se Unknown Unknown
libmpi.so.40.10.0 00002AD2F16511B3 ompi_coll_base_al Unknown Unknown
libmpi.so.40.10.0 00002AD2F1607F7F PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00002AD2F13991D7 mpi_allreduce_ Unknown Unknown
What are useful steps I can do to debug? Recompile with —enable-debug? Are
there any other versions that are worth trying? I don’t recall this error
happening before we switched to 3.1.0.
thanks,
Noam
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users