Hi, > Am 05.04.2018 um 16:16 schrieb Noam Bernstein <noam.bernst...@nrl.navy.mil>: > > Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange > way. Basically, there’s a Cartesian communicator, 4x16 (64 processes total), > and despite the fact that the communication pattern is rather regular, one > particular send/recv pair hangs consistently. Basically, across each row of > 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0. On most of the 16 > such sets all those send/recv pairs complete. However, on 2 of them, it > hangs (both the send and recv). I have stack traces (with gdb -p on the > running processes) from what I believe are corresponding send/recv pairs. > > <snip> > > This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), > Intel compilers (17.2.174). It seems to be independent of which nodes, always > happens on this pair of calls and happens after the code has been running for > a while, and the same code for the other 14 sets of 4 work fine, suggesting > that it’s an MPI issue, rather than an obvious bug in this code or a hardware > problem. Does anyone have any ideas, either about possible causes or how to > debug things further?
Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with the Intel compilers for VASP and found, that using in addition a self-compiled scaLAPACK is working fine in combination with Open MPI. Using Intel scaLAPACK and Intel MPI is also working fine. What I never got working was the combination Intel scaLAPACK and Open MPI – at one point one process got a message from a wrong rank IIRC. I tried both: the Intel supplied Open MPI version of scaLAPACK and also compiling the necessary interface on my own for Open MPI in $MKLROOT/interfaces/mklmpi with identical results. -- Reuti _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users