Hello,
I am using MPI_Comm_spawn to dynamically create new processes from single
manager process. Everything works fine when all the processes are running
on the same node. But imposing restriction to run only a single process per
node does not work. Below are the errors produced during multinode
interactive session and multinode sbatch job.
The system I am using is: Linux version 3.10.0-229.el7.x86_64 (
buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat
4.8.2-16) (GCC) )
I am using Open MPI 2.0.1
Slurm is version 15.08.9
What is preventing my jobs to spawn on multiple nodes? Does slurm requires
some additional configuration to allow it? Is it issue on the MPI side,
does it need to be compiled with some special flag (I have compiled it with
--enable-mpi-fortran=all --with-pmi)?
The code I am launching is here: https://github.com/goghino/dynamicMPI
Manager tries to launch one new process (./manager 1), the error produced
by requesting each process to be located on different node (interactive
session):
$ salloc -N 2
$ cat my_hosts
icsnode37
icsnode38
$ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
[manager]I'm running MPI 3.1
[manager]Runing on node icsnode37
icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
[icsnode37:12614] *** Process received signal ***
[icsnode37:12614] Signal: Aborted (6)
[icsnode37:12614] Signal code: (-6)
[icsnode38:32443] *** Process received signal ***
[icsnode38:32443] Signal: Aborted (6)
[icsnode38:32443] Signal code: (-6)
The same example as above via sbatch job submission:
$ cat job.sbatch
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
module load openmpi/2.0.1
srun -n 1 -N 1 ./manager 1
$ cat output.o
[manager]I'm running MPI 3.1
[manager]Runing on node icsnode39
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[icsnode39:9692] *** An error occurred in MPI_Comm_spawn
[icsnode39:9692] *** reported by process [1007812608,0]
[icsnode39:9692] *** on communicator MPI_COMM_SELF
[icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
[icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[icsnode39:9692] ***and potentially your MPI job)
In: PMI_Abort(50, N/A)
slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20
***
srun: error: icsnode39: task 0: Exited with exit code 50
Thank for any feedback!
Best regards,
Juraj
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users