I would do the normal things. Log into those nodes. Run  dmesg  and look at
/var/log/messages
Look at the Slurm log on the node and look for the job ending.

Also look at the sysstat files and see if there was a lot of memory being
used http://sebastien.godard.pagesperso-orange.fr/

On Wed, 17 Apr 2019 at 09:16, Mahmood Naderan <mahmood...@gmail.com> wrote:

> Hi,
> A QuantumEspresso, multinode and multiprocess MPI job has been terminated
> with the following messages in the log file
>
>
>      total cpu time spent up to now is    63540.4 secs
>
>      total energy              =  -14004.61932175 Ry
>      Harris-Foulkes estimate   =  -14004.73511665 Ry
>      estimated scf accuracy    <       0.84597958 Ry
>
>      iteration #  7     ecut=    48.95 Ry     beta= 0.70
>      Davidson diagonalization with overlap
> --------------------------------------------------------------------------
> ORTE has lost communication with a remote daemon.
>
>   HNP daemon   : [[7952,0],0] on node compute-0-0
>   Remote daemon: [[7952,0],1] on node compute-0-1
>
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --------------------------------------------------------------------------
>
>
>
>
> The slurm script for that is
>
> #!/bin/bash
> #SBATCH --job-name=myQE
> #SBATCH --output=mos2.rlx.out
> #SBATCH --ntasks=14
> #SBATCH --mem-per-cpu=17G
> #SBATCH --nodes=6
> #SBATCH --partition=QUARTZ
> #SBATCH --account=z5
> mpirun pw.x -i mos2.rlx.in
>
>
> The job is running on Slurm 18.08 and Rocks7 which its default OpenMPI
> 2.1.1.
>
> Other jobs with OMPI and slurm and QE are fine. So, I want to know how can
> I narrow my searches to find the root of the problem of this specific
> problem. For example, I don't know if the QE job had been diverged in
> calculations or not. Is there any way to find more information about that.
>
> Any idea?
>
> Regards,
> Mahmood
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to