> On Apr 5, 2018, at 11:03 AM, Reuti <re...@staff.uni-marburg.de> wrote:
> 
> Hi,
> 
>> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <noam.bernst...@nrl.navy.mil>:
>> 
>> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange 
>> way.  Basically, there’s a Cartesian communicator, 4x16 (64 processes 
>> total), and despite the fact that the communication pattern is rather 
>> regular, one particular send/recv pair hangs consistently.  Basically, 
>> across each row of 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0. 
>>  On most of the 16 such sets all those send/recv pairs complete.  However, 
>> on 2 of them, it hangs (both the send and recv).  I have stack traces (with 
>> gdb -p on the running processes) from what I believe are corresponding 
>> send/recv pairs.  
>> 
>> <snip>
>> 
>> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), 
>> Intel compilers (17.2.174). It seems to be independent of which nodes, 
>> always happens on this pair of calls and happens after the code has been 
>> running for a while, and the same code for the other 14 sets of 4 work fine, 
>> suggesting that it’s an MPI issue, rather than an obvious bug in this code 
>> or a hardware problem.  Does anyone have any ideas, either about possible 
>> causes or how to debug things further?
> 
> Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with 
> the Intel compilers for VASP and found, that using in addition a 
> self-compiled scaLAPACK is working fine in combination with Open MPI. Using 
> Intel scaLAPACK and Intel MPI is also working fine. What I never got working 
> was the combination Intel scaLAPACK and Open MPI – at one point one process 
> got a message from a wrong rank IIRC. I tried both: the Intel supplied Open 
> MPI version of scaLAPACK and also compiling the necessary interface on my own 
> for Open MPI in $MKLROOT/interfaces/mklmpi with identical results.

MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I set 
LSCALAPCK=.FALSE. I suppose I could try compiling without it just to test.  In 
any case, this is when it’s writing out the wavefunctions, which I would assume 
be unrelated to scalapack operations (unless they’re corrupting some low level 
MPI thing, I guess).

                                                                                
                Noam

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to