Hello, I am using MPI_Comm_spawn to dynamically create new processes from single manager process. Everything works fine when all the processes are running on the same node. But imposing restriction to run only a single process per node does not work. Below are the errors produced during multinode interactive session and multinode sbatch job.
The system I am using is: Linux version 3.10.0-229.el7.x86_64 ( buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) I am using Open MPI 2.0.1 Slurm is version 15.08.9 What is preventing my jobs to spawn on multiple nodes? Does slurm requires some additional configuration to allow it? Is it issue on the MPI side, does it need to be compiled with some special flag (I have compiled it with --enable-mpi-fortran=all --with-pmi)? The code I am launching is here: https://github.com/goghino/dynamicMPI Manager tries to launch one new process (./manager 1), the error produced by requesting each process to be located on different node (interactive session): $ salloc -N 2 $ cat my_hosts icsnode37 icsnode38 $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1 [manager]I'm running MPI 3.1 [manager]Runing on node icsnode37 icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0) icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0) [icsnode37:12614] *** Process received signal *** [icsnode37:12614] Signal: Aborted (6) [icsnode37:12614] Signal code: (-6) [icsnode38:32443] *** Process received signal *** [icsnode38:32443] Signal: Aborted (6) [icsnode38:32443] Signal code: (-6) The same example as above via sbatch job submission: $ cat job.sbatch #!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 module load openmpi/2.0.1 srun -n 1 -N 1 ./manager 1 $ cat output.o [manager]I'm running MPI 3.1 [manager]Runing on node icsnode39 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. [icsnode39:9692] *** An error occurred in MPI_Comm_spawn [icsnode39:9692] *** reported by process [1007812608,0] [icsnode39:9692] *** on communicator MPI_COMM_SELF [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [icsnode39:9692] *** and potentially your MPI job) In: PMI_Abort(50, N/A) slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20 *** srun: error: icsnode39: task 0: Exited with exit code 50 Thank for any feedback! Best regards, Juraj
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users