Hi, I do not expect spawn can work with direct launch (e.g. srun)
Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the failure Can you please try mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts ./manager 1 and see if it help ? Note if you have the possibility, I suggest you first try that without slurm, and then within a slurm job Cheers, Gilles On Thursday, September 29, 2016, juraj2...@gmail.com <juraj2...@gmail.com> wrote: > Hello, > > I am using MPI_Comm_spawn to dynamically create new processes from single > manager process. Everything works fine when all the processes are running > on the same node. But imposing restriction to run only a single process per > node does not work. Below are the errors produced during multinode > interactive session and multinode sbatch job. > > The system I am using is: Linux version 3.10.0-229.el7.x86_64 ( > buil...@kbuilder.dev.centos.org > <javascript:_e(%7B%7D,'cvml','buil...@kbuilder.dev.centos.org');>) (gcc > version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) > I am using Open MPI 2.0.1 > Slurm is version 15.08.9 > > What is preventing my jobs to spawn on multiple nodes? Does slurm requires > some additional configuration to allow it? Is it issue on the MPI side, > does it need to be compiled with some special flag (I have compiled it with > --enable-mpi-fortran=all --with-pmi)? > > The code I am launching is here: https://github.com/goghino/dynamicMPI > > Manager tries to launch one new process (./manager 1), the error produced > by requesting each process to be located on different node (interactive > session): > $ salloc -N 2 > $ cat my_hosts > icsnode37 > icsnode38 > $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1 > [manager]I'm running MPI 3.1 > [manager]Runing on node icsnode37 > icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0) > icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0) > [icsnode37:12614] *** Process received signal *** > [icsnode37:12614] Signal: Aborted (6) > [icsnode37:12614] Signal code: (-6) > [icsnode38:32443] *** Process received signal *** > [icsnode38:32443] Signal: Aborted (6) > [icsnode38:32443] Signal code: (-6) > > The same example as above via sbatch job submission: > $ cat job.sbatch > #!/bin/bash > > #SBATCH --nodes=2 > #SBATCH --ntasks-per-node=1 > > module load openmpi/2.0.1 > srun -n 1 -N 1 ./manager 1 > > $ cat output.o > [manager]I'm running MPI 3.1 > [manager]Runing on node icsnode39 > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > [icsnode39:9692] *** An error occurred in MPI_Comm_spawn > [icsnode39:9692] *** reported by process [1007812608,0] > [icsnode39:9692] *** on communicator MPI_COMM_SELF > [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes > [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [icsnode39:9692] *** and potentially your MPI job) > In: PMI_Abort(50, N/A) > slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20 > *** > srun: error: icsnode39: task 0: Exited with exit code 50 > > Thank for any feedback! > > Best regards, > Juraj >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users