Hi Andrés, Did you recompile OpenMPI after updating to SLURM 19.05?
Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services Research Computing | CoEPP | School of Physics University of Melbourne On Thu, 6 Jun 2019 at 20:11, Andrés Marín Díaz <ama...@cesvima.upm.es<mailto:ama...@cesvima.upm.es>> wrote: Thank you very much for the help, I update some information. - If we use Intel MPI (IMPI) mpirun it works correctly. - If we use mpirun without using the scheduler it works correctly. - If we use srun with software compiled with OpenMPI it works correctly. - If we use SLURM 18.08.6 it works correctly. - If we use SLURM 19.05.0 and mpirun inside the sbatch script then we get the error: -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to lack of common network interfaces and / or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- Trying to trace the problem: - Mpirun is a binary and can not be traced with batch -x. - I've done a "strace mpirun hostname" to see if it helps, but i am not able to see where the problem may be. Here is the exit from the strace: https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW And here the slurmd log with verbose level 5: Main node (slurmd log): 2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]: _run_prolog: run job script took usec=7 2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]: _run_prolog: prolog with lock for job 11057 ran for 0 seconds 2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]: task_p_slurmd_batch_request: 11057 2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]: task/affinity: job 11057 CPU input mask for node: 0x0000000001 2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]: task/affinity: job 11057 CPU final HW mask for node: 0x0000000001 2019-06-06T09:51:54.279614+00:00 r1n1 slurmstepd[108548]: task affinity plugin loaded with CPU mask 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff 2019-06-06T09:51:54.280312+00:00 r1n1 slurmstepd[108548]: Munge credential signature plugin loaded 2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]: task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited 2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]: task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited 2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]: Launching batch job 11057 for UID 2000 2019-06-06T09:51:54.353196+00:00 r1n1 slurmstepd[108556]: task affinity plugin loaded with CPU mask 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff 2019-06-06T09:51:54.353899+00:00 r1n1 slurmstepd[108556]: Munge credential signature plugin loaded 2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]: task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited 2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]: task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited 2019-06-06T09:51:54.393325+00:00 r1n1 slurmstepd[108556]: debug level = 2 2019-06-06T09:51:54.393754+00:00 r1n1 slurmstepd[108556]: starting 1 tasks 2019-06-06T09:51:54.401243+00:00 r1n1 slurmstepd[108556]: task 0 (108561) started 2019-06-06T09:51:54 2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]: task_p_pre_launch: Using sched_affinity for tasks 2019-06-06T09:51:56.514908+00:00 r1n1 slurmstepd[108556]: task 0 (108561) exited with exit code 1. 2019-06-06T09:51:56.554430+00:00 r1n1 slurmstepd[108556]: job 11057 completed with slurm_rc = 0, job_rc = 256 2019-06-06T09:51:56.554847+00:00 r1n1 slurmstepd[108556]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256 2019-06-06T09:51:56.559856+00:00 r1n1 slurmstepd[108556]: done with job 2019-06-06T09:51:56.596762+00:00 r1n1 slurmstepd[108548]: Sent signal 18 to 11057.4294967295 2019-06-06T09:51:56.598072+00:00 r1n1 slurmstepd[108548]: Sent signal 15 to 11057.4294967295 2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]: _oom_event_monitor: oom-kill event count: 1 2019-06-06T09:51:56.641170+00:00 r1n1 slurmstepd[108548]: done with job Secundary node (slurmd log): 2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]: _run_prolog: run job script took usec=7 2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]: _run_prolog: prolog with lock for job 11057 ran for 0 seconds 2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]: task affinity plugin loaded with CPU mask 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff 2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]: Munge credential signature plugin loaded 2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]: task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited 2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]: task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited 2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]: Sent signal 18 to 11057.4294967295 2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]: Sent signal 15 to 11057.4294967295 2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]: _oom_event_monitor: oom-kill event count: 1 2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]: done with job Thank you very much again. -- Andrés Marín Díaz Servicio de Infraestructura e Innovación Universidad Politécnica de Madrid Centro de Supercomputación y Visualización de Madrid (CeSViMa) Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES) ama...@cesvima.upm.es<mailto:ama...@cesvima.upm.es> | tel 910679676 www.cesvima.upm.es<http://www.cesvima.upm.es> | www.twitter.com/cesvima<http://www.twitter.com/cesvima> | www.fb.com/cesvima<http://www.fb.com/cesvima>