I would do the normal things. Log into those nodes. Run dmesg and look at /var/log/messages Look at the Slurm log on the node and look for the job ending.
Also look at the sysstat files and see if there was a lot of memory being used http://sebastien.godard.pagesperso-orange.fr/ On Wed, 17 Apr 2019 at 09:16, Mahmood Naderan <mahmood...@gmail.com> wrote: > Hi, > A QuantumEspresso, multinode and multiprocess MPI job has been terminated > with the following messages in the log file > > > total cpu time spent up to now is 63540.4 secs > > total energy = -14004.61932175 Ry > Harris-Foulkes estimate = -14004.73511665 Ry > estimated scf accuracy < 0.84597958 Ry > > iteration # 7 ecut= 48.95 Ry beta= 0.70 > Davidson diagonalization with overlap > -------------------------------------------------------------------------- > ORTE has lost communication with a remote daemon. > > HNP daemon : [[7952,0],0] on node compute-0-0 > Remote daemon: [[7952,0],1] on node compute-0-1 > > This is usually due to either a failure of the TCP network > connection to the node, or possibly an internal failure of > the daemon itself. We cannot recover from this failure, and > therefore will terminate the job. > -------------------------------------------------------------------------- > > > > > The slurm script for that is > > #!/bin/bash > #SBATCH --job-name=myQE > #SBATCH --output=mos2.rlx.out > #SBATCH --ntasks=14 > #SBATCH --mem-per-cpu=17G > #SBATCH --nodes=6 > #SBATCH --partition=QUARTZ > #SBATCH --account=z5 > mpirun pw.x -i mos2.rlx.in > > > The job is running on Slurm 18.08 and Rocks7 which its default OpenMPI > 2.1.1. > > Other jobs with OMPI and slurm and QE are fine. So, I want to know how can > I narrow my searches to find the root of the problem of this specific > problem. For example, I don't know if the QE job had been diverged in > calculations or not. Is there any way to find more information about that. > > Any idea? > > Regards, > Mahmood > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users