Hi,

I do not expect spawn can work with direct launch (e.g. srun)

Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the
failure

Can you please try

mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts
./manager 1

and see if it help ?

Note if you have the possibility, I suggest you first try that without
slurm, and then within a slurm job

Cheers,

Gilles

On Thursday, September 29, 2016, juraj2...@gmail.com <juraj2...@gmail.com>
wrote:

> Hello,
>
> I am using MPI_Comm_spawn to dynamically create new processes from single
> manager process. Everything works fine when all the processes are running
> on the same node. But imposing restriction to run only a single process per
> node does not work. Below are the errors produced during multinode
> interactive session and multinode sbatch job.
>
> The system I am using is: Linux version 3.10.0-229.el7.x86_64 (
> buil...@kbuilder.dev.centos.org
> <javascript:_e(%7B%7D,'cvml','buil...@kbuilder.dev.centos.org');>) (gcc
> version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) )
> I am using Open MPI 2.0.1
> Slurm is version 15.08.9
>
> What is preventing my jobs to spawn on multiple nodes? Does slurm requires
> some additional configuration to allow it? Is it issue on the MPI side,
> does it need to be compiled with some special flag (I have compiled it with
> --enable-mpi-fortran=all --with-pmi)?
>
> The code I am launching is here: https://github.com/goghino/dynamicMPI
>
> Manager tries to launch one new process (./manager 1), the error produced
> by requesting each process to be located on different node (interactive
> session):
> $ salloc -N 2
> $ cat my_hosts
> icsnode37
> icsnode38
> $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
> [manager]I'm running MPI 3.1
> [manager]Runing on node icsnode37
> icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
> icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
> [icsnode37:12614] *** Process received signal ***
> [icsnode37:12614] Signal: Aborted (6)
> [icsnode37:12614] Signal code:  (-6)
> [icsnode38:32443] *** Process received signal ***
> [icsnode38:32443] Signal: Aborted (6)
> [icsnode38:32443] Signal code:  (-6)
>
> The same example as above via sbatch job submission:
> $ cat job.sbatch
> #!/bin/bash
>
> #SBATCH --nodes=2
> #SBATCH --ntasks-per-node=1
>
> module load openmpi/2.0.1
> srun -n 1 -N 1 ./manager 1
>
> $ cat output.o
> [manager]I'm running MPI 3.1
> [manager]Runing on node icsnode39
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> [icsnode39:9692] *** An error occurred in MPI_Comm_spawn
> [icsnode39:9692] *** reported by process [1007812608,0]
> [icsnode39:9692] *** on communicator MPI_COMM_SELF
> [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
> [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [icsnode39:9692] ***    and potentially your MPI job)
> In: PMI_Abort(50, N/A)
> slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20
> ***
> srun: error: icsnode39: task 0: Exited with exit code 50
>
> Thank for any feedback!
>
> Best regards,
> Juraj
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to