Hi,

> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <noam.bernst...@nrl.navy.mil>:
> 
> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange 
> way.  Basically, there’s a Cartesian communicator, 4x16 (64 processes total), 
> and despite the fact that the communication pattern is rather regular, one 
> particular send/recv pair hangs consistently.  Basically, across each row of 
> 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0.  On most of the 16 
> such sets all those send/recv pairs complete.  However, on 2 of them, it 
> hangs (both the send and recv).  I have stack traces (with gdb -p on the 
> running processes) from what I believe are corresponding send/recv pairs.  
> 
> <snip>
> 
> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), 
> Intel compilers (17.2.174). It seems to be independent of which nodes, always 
> happens on this pair of calls and happens after the code has been running for 
> a while, and the same code for the other 14 sets of 4 work fine, suggesting 
> that it’s an MPI issue, rather than an obvious bug in this code or a hardware 
> problem.  Does anyone have any ideas, either about possible causes or how to 
> debug things further?

Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with the 
Intel compilers for VASP and found, that using in addition a self-compiled 
scaLAPACK is working fine in combination with Open MPI. Using Intel scaLAPACK 
and Intel MPI is also working fine. What I never got working was the 
combination Intel scaLAPACK and Open MPI – at one point one process got a 
message from a wrong rank IIRC. I tried both: the Intel supplied Open MPI 
version of scaLAPACK and also compiling the necessary interface on my own for 
Open MPI in $MKLROOT/interfaces/mklmpi with identical results.

-- Reuti
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to