> On Apr 5, 2018, at 11:03 AM, Reuti <[email protected]> wrote:
>
> Hi,
>
>> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <[email protected]>:
>>
>> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange
>> way. Basically, there’s a Cartesian communicator, 4x16 (64 processes
>> total), and despite the fact that the communication pattern is rather
>> regular, one particular send/recv pair hangs consistently. Basically,
>> across each row of 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0.
>> On most of the 16 such sets all those send/recv pairs complete. However,
>> on 2 of them, it hangs (both the send and recv). I have stack traces (with
>> gdb -p on the running processes) from what I believe are corresponding
>> send/recv pairs.
>>
>> <snip>
>>
>> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions),
>> Intel compilers (17.2.174). It seems to be independent of which nodes,
>> always happens on this pair of calls and happens after the code has been
>> running for a while, and the same code for the other 14 sets of 4 work fine,
>> suggesting that it’s an MPI issue, rather than an obvious bug in this code
>> or a hardware problem. Does anyone have any ideas, either about possible
>> causes or how to debug things further?
>
> Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with
> the Intel compilers for VASP and found, that using in addition a
> self-compiled scaLAPACK is working fine in combination with Open MPI. Using
> Intel scaLAPACK and Intel MPI is also working fine. What I never got working
> was the combination Intel scaLAPACK and Open MPI – at one point one process
> got a message from a wrong rank IIRC. I tried both: the Intel supplied Open
> MPI version of scaLAPACK and also compiling the necessary interface on my own
> for Open MPI in $MKLROOT/interfaces/mklmpi with identical results.
MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I set
LSCALAPCK=.FALSE. I suppose I could try compiling without it just to test. In
any case, this is when it’s writing out the wavefunctions, which I would assume
be unrelated to scalapack operations (unless they’re corrupting some low level
MPI thing, I guess).
Noam
_______________________________________________
users mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/users