Hi OpenMPI users - I’m trying to debug a non-deterministic crash, apparently in 
opal_progress, with OpenMPI 3.1.0.  All of them seem to involve mpi_allreduce, 
although it’s different particular calls from this code (VASP), and they seem 
more frequent for larger core/mpi task counts (128 happens within a few 
minutes, 5-200 iterations of the code, while 16 cores run thousands of 
iterations without the problem happening).  The tail end of the stack trace 
looks like
libopen-pal.so.40  00002AC204D2B890  opal_progress         Unknown  Unknown
libmpi.so.40.10.0  00002AC2047E6CEA  ompi_coll_base_se     Unknown  Unknown
libmpi.so.40.10.0  00002AC2047E81B3  ompi_coll_base_al     Unknown  Unknown
libmpi.so.40.10.0  00002AC20479EF7F  PMPI_Allreduce        Unknown  Unknown
libmpi_mpifh.so.4  00002AC2045301D7  mpi_allreduce_        Unknown  Unknown
or 
libopen-pal.so.40  00002AD2F1B94890  opal_progress         Unknown  Unknown
libmpi.so.40.10.0  00002AD2F15F678D  ompi_request_defa     Unknown  Unknown
libmpi.so.40.10.0  00002AD2F164FD00  ompi_coll_base_se     Unknown  Unknown
libmpi.so.40.10.0  00002AD2F16511B3  ompi_coll_base_al     Unknown  Unknown
libmpi.so.40.10.0  00002AD2F1607F7F  PMPI_Allreduce        Unknown  Unknown
libmpi_mpifh.so.4  00002AD2F13991D7  mpi_allreduce_        Unknown  Unknown

What are useful steps I can do to debug?  Recompile with —enable-debug?  Are 
there any other versions that are worth trying?  I don’t recall this error 
happening before we switched to 3.1.0.

                                                                        thanks,
                                                                        Noam
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to