Spawn definitely does not work with srun. I don’t recognize the name of the 
file that segfaulted - what is “ptl.c”? Is that in your manager program?


> On Sep 29, 2016, at 6:06 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> Hi,
> 
> I do not expect spawn can work with direct launch (e.g. srun)
> 
> Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the 
> failure
> 
> Can you please try
> 
> mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts 
> ./manager 1
> 
> and see if it help ?
> 
> Note if you have the possibility, I suggest you first try that without slurm, 
> and then within a slurm job
> 
> Cheers,
> 
> Gilles
> 
> On Thursday, September 29, 2016, juraj2...@gmail.com 
> <mailto:juraj2...@gmail.com> <juraj2...@gmail.com 
> <mailto:juraj2...@gmail.com>> wrote:
> Hello,
> 
> I am using MPI_Comm_spawn to dynamically create new processes from single 
> manager process. Everything works fine when all the processes are running on 
> the same node. But imposing restriction to run only a single process per node 
> does not work. Below are the errors produced during multinode interactive 
> session and multinode sbatch job.
> 
> The system I am using is: Linux version 3.10.0-229.el7.x86_64 
> (buil...@kbuilder.dev.centos.org 
> <javascript:_e(%7B%7D,'cvml','buil...@kbuilder.dev.centos.org');>) (gcc 
> version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) )
> I am using Open MPI 2.0.1
> Slurm is version 15.08.9
> 
> What is preventing my jobs to spawn on multiple nodes? Does slurm requires 
> some additional configuration to allow it? Is it issue on the MPI side, does 
> it need to be compiled with some special flag (I have compiled it with 
> --enable-mpi-fortran=all --with-pmi)? 
> 
> The code I am launching is here: https://github.com/goghino/dynamicMPI 
> <https://github.com/goghino/dynamicMPI>
> 
> Manager tries to launch one new process (./manager 1), the error produced by 
> requesting each process to be located on different node (interactive session):
> $ salloc -N 2
> $ cat my_hosts
> icsnode37
> icsnode38
> $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
> [manager]I'm running MPI 3.1
> [manager]Runing on node icsnode37
> icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
> icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
> [icsnode37:12614] *** Process received signal ***
> [icsnode37:12614] Signal: Aborted (6)
> [icsnode37:12614] Signal code:  (-6)
> [icsnode38:32443] *** Process received signal ***
> [icsnode38:32443] Signal: Aborted (6)
> [icsnode38:32443] Signal code:  (-6)
> 
> The same example as above via sbatch job submission:
> $ cat job.sbatch
> #!/bin/bash
> 
> #SBATCH --nodes=2
> #SBATCH --ntasks-per-node=1
> 
> module load openmpi/2.0.1
> srun -n 1 -N 1 ./manager 1
> 
> $ cat output.o
> [manager]I'm running MPI 3.1
> [manager]Runing on node icsnode39
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> [icsnode39:9692] *** An error occurred in MPI_Comm_spawn
> [icsnode39:9692] *** reported by process [1007812608,0]
> [icsnode39:9692] *** on communicator MPI_COMM_SELF
> [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
> [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
> will now abort,
> [icsnode39:9692] ***    and potentially your MPI job)
> In: PMI_Abort(50, N/A)
> slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20 ***
> srun: error: icsnode39: task 0: Exited with exit code 50
> 
> Thank for any feedback!
> 
> Best regards,
> Juraj
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to