Hi all - I've been trying to debug a segfault in OpenMPI 3.1.2, and in the
process I noticed that 3.1.3 is out, so I thought I'd test it. However, with
3.1.3 the code (LAMMPS) hangs very early, in dealing with input. I'm running
16 tasks on a single 16 core node, with Infiniband (which it may be using,
although it's only one node). Attaching to the 16 hung processes with gdb it
appears that 15 of them are in PMPI_Cart_create (input.cpp line 243), while one
is stuck on an earlier Bcast (input.cpp line 222), which is presumably the
actual hanging task:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00002b7303ac7f6f in opal_progress () from
/share/apps/mpi/openmpi/3.1.3/ib/gnu/lib/libopen-pal.so.40
#0 0x00002b7303ac7f6f in opal_progress () from
/share/apps/mpi/openmpi/3.1.3/ib/gnu/lib/libopen-pal.so.40
#1 0x00002b73022e0675 in ompi_request_default_wait () from
/share/apps/mpi/openmpi/3.1.3/ib/gnu/lib/libmpi.so.40
#2 0x00002b73023286ee in ompi_coll_base_bcast_intra_generic () from
/share/apps/mpi/openmpi/3.1.3/ib/gnu/lib/libmpi.so.40
#3 0x00002b7302328b67 in ompi_coll_base_bcast_intra_binomial () from
/share/apps/mpi/openmpi/3.1.3/ib/gnu/lib/libmpi.so.40
#4 0x00002b731881670c in ompi_coll_tuned_bcast_intra_dec_fixed () from
/share/apps/mpi/openmpi/3.1.3/ib/gnu/lib/openmpi/mca_coll_tuned.so
#5 0x00002b73022f5b19 in PMPI_Bcast () from
/share/apps/mpi/openmpi/3.1.3/ib/gnu/lib/libmpi.so.40
#6 0x0000000000483fdb in LAMMPS_NS::Input::file (this=0x3a62eb0) at
../input.cpp:222
#7 0x000000000040bc48 in main (argc=<optimized out>, argv=<optimized out>) at
../main.cpp:54
The compilation is pretty straightforward. CentOS 7 stock gcc (4.8.5), CentOS
stock IB support and "--with-verbs --with-ofi” flags to configure. My first
thought was to add —enable-debug and —enable-mem-debug, but with those
configure options on, the process does not hang.
Does anyone have any suggestions for investigating further?
thanks,
Noam
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users