Dear everyone, I have a calculation (the CP2K program) using MPI over Infiniband and it is stuck. All processes (16 on 4 nodes) are running, taking 100% CPU. Attaching a debugger reveals this (only the end of the stack shown here):
(gdb) backtrace #0 0x00002b3460916dbf in btl_openib_component_progress () from /home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_btl_openib.so #1 0x00002b345c22c778 in opal_progress () from /home/marsalek/opt/openmpi-1.3-intel/lib/libopen-pal.so.0 #2 0x00002b345bd2d66d in ompi_request_default_wait_any () from /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0 #3 0x00002b345bd6021a in PMPI_Waitany () from /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0 #4 0x00002b345bae77f1 in pmpi_waitany__ () from /home/marsalek/opt/openmpi-1.3-intel/lib/libmpi_f77.so.0 It has survived a restart of the IB switch, unlike "healthy" runs. My question is - is it obvious at what level the problem is? IB, Open MPI, application?I would be glad to provide detailed information, if anyone was willing to help. I want to work on this, but unfortunately I am not sure where to begin. Best regards, Ondrej Marsalek