Thank you very much for the help, I update some information.

- If we use Intel MPI (IMPI) mpirun it works correctly.
- If we use mpirun without using the scheduler it works correctly.
- If we use srun with software compiled with OpenMPI it works correctly.
- If we use SLURM 18.08.6 it works correctly.
- If we use SLURM 19.05.0 and mpirun inside the sbatch script then we get the error:
--------------------------------------------------------------------------
    An ORTE daemon has unexpectedly failed after launch and before
    communicating back to mpirun. This could be caused by a number
    of factors, including an inability to create a connection back
    to mpirun due to lack of common network interfaces and / or no
    route found between them. Please check network connectivity
    (including firewalls and network routing requirements).
--------------------------------------------------------------------------

Trying to trace the problem:
- Mpirun is a binary and can not be traced with batch -x.
- I've done a "strace mpirun hostname" to see if it helps, but i am not able to see where the problem may be.

Here is the exit from the strace:
https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW

And here the slurmd log with verbose level 5:
Main node (slurmd log):
    2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]: _run_prolog: run job script took usec=7     2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]: _run_prolog: prolog with lock for job 11057 ran for 0 seconds     2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]: task_p_slurmd_batch_request: 11057     2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]: task/affinity: job 11057 CPU input mask for node: 0x0000000001     2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]: task/affinity: job 11057 CPU final HW mask for node: 0x0000000001     2019-06-06T09:51:54.279614+00:00 r1n1 slurmstepd[108548]: task affinity plugin loaded with CPU mask 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff     2019-06-06T09:51:54.280312+00:00 r1n1 slurmstepd[108548]: Munge credential signature plugin loaded     2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]: task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited     2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]: task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited     2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]: Launching batch job 11057 for UID 2000     2019-06-06T09:51:54.353196+00:00 r1n1 slurmstepd[108556]: task affinity plugin loaded with CPU mask 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff     2019-06-06T09:51:54.353899+00:00 r1n1 slurmstepd[108556]: Munge credential signature plugin loaded     2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]: task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited     2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]: task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited     2019-06-06T09:51:54.393325+00:00 r1n1 slurmstepd[108556]: debug level = 2     2019-06-06T09:51:54.393754+00:00 r1n1 slurmstepd[108556]: starting 1 tasks     2019-06-06T09:51:54.401243+00:00 r1n1 slurmstepd[108556]: task 0 (108561) started 2019-06-06T09:51:54     2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]: task_p_pre_launch: Using sched_affinity for tasks     2019-06-06T09:51:56.514908+00:00 r1n1 slurmstepd[108556]: task 0 (108561) exited with exit code 1.     2019-06-06T09:51:56.554430+00:00 r1n1 slurmstepd[108556]: job 11057 completed with slurm_rc = 0, job_rc = 256     2019-06-06T09:51:56.554847+00:00 r1n1 slurmstepd[108556]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
    2019-06-06T09:51:56.559856+00:00 r1n1 slurmstepd[108556]: done with job
    2019-06-06T09:51:56.596762+00:00 r1n1 slurmstepd[108548]: Sent signal 18 to 11057.4294967295     2019-06-06T09:51:56.598072+00:00 r1n1 slurmstepd[108548]: Sent signal 15 to 11057.4294967295     2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]: _oom_event_monitor: oom-kill event count: 1
    2019-06-06T09:51:56.641170+00:00 r1n1 slurmstepd[108548]: done with job

Secundary node (slurmd log):
    2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]: _run_prolog: run job script took usec=7     2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]: _run_prolog: prolog with lock for job 11057 ran for 0 seconds     2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]: task affinity plugin loaded with CPU mask 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff     2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]: Munge credential signature plugin loaded     2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]: task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited     2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]: task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited     2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]: Sent signal 18 to 11057.4294967295     2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]: Sent signal 15 to 11057.4294967295     2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]: _oom_event_monitor: oom-kill event count: 1
    2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]: done with job

Thank you very much again.

--
 Andrés Marín Díaz
Servicio de Infraestructura e Innovación
 Universidad Politécnica de Madrid
Centro de Supercomputación y Visualización de Madrid (CeSViMa)
 Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
 ama...@cesvima.upm.es | tel 910679676
www.cesvima.upm.es | www.twitter.com/cesvima | www.fb.com/cesvima

Reply via email to