> On Apr 5, 2018, at 11:03 AM, Reuti <re...@staff.uni-marburg.de> wrote: > > Hi, > >> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <noam.bernst...@nrl.navy.mil>: >> >> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange >> way. Basically, there’s a Cartesian communicator, 4x16 (64 processes >> total), and despite the fact that the communication pattern is rather >> regular, one particular send/recv pair hangs consistently. Basically, >> across each row of 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0. >> On most of the 16 such sets all those send/recv pairs complete. However, >> on 2 of them, it hangs (both the send and recv). I have stack traces (with >> gdb -p on the running processes) from what I believe are corresponding >> send/recv pairs. >> >> <snip> >> >> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), >> Intel compilers (17.2.174). It seems to be independent of which nodes, >> always happens on this pair of calls and happens after the code has been >> running for a while, and the same code for the other 14 sets of 4 work fine, >> suggesting that it’s an MPI issue, rather than an obvious bug in this code >> or a hardware problem. Does anyone have any ideas, either about possible >> causes or how to debug things further? > > Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with > the Intel compilers for VASP and found, that using in addition a > self-compiled scaLAPACK is working fine in combination with Open MPI. Using > Intel scaLAPACK and Intel MPI is also working fine. What I never got working > was the combination Intel scaLAPACK and Open MPI – at one point one process > got a message from a wrong rank IIRC. I tried both: the Intel supplied Open > MPI version of scaLAPACK and also compiling the necessary interface on my own > for Open MPI in $MKLROOT/interfaces/mklmpi with identical results.
MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I set LSCALAPCK=.FALSE. I suppose I could try compiling without it just to test. In any case, this is when it’s writing out the wavefunctions, which I would assume be unrelated to scalapack operations (unless they’re corrupting some low level MPI thing, I guess). Noam _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users