Ah, that may be why it wouldn’t show up in the OMPI code base itself. If that is the case here, then no - OMPI v2.0.1 does not support comm_spawn for PSM. It is fixed in the upcoming 2.0.2
> On Sep 29, 2016, at 6:58 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > My guess is that ptl.c comes from PSM lib ... > > Cheers, > > Gilles > > On Thursday, September 29, 2016, r...@open-mpi.org <mailto:r...@open-mpi.org> > <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote: > Spawn definitely does not work with srun. I don’t recognize the name of the > file that segfaulted - what is “ptl.c”? Is that in your manager program? > > >> On Sep 29, 2016, at 6:06 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com >> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote: >> >> Hi, >> >> I do not expect spawn can work with direct launch (e.g. srun) >> >> Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the >> failure >> >> Can you please try >> >> mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts >> ./manager 1 >> >> and see if it help ? >> >> Note if you have the possibility, I suggest you first try that without >> slurm, and then within a slurm job >> >> Cheers, >> >> Gilles >> >> On Thursday, September 29, 2016, juraj2...@gmail.com >> <javascript:_e(%7B%7D,'cvml','juraj2...@gmail.com');> <juraj2...@gmail.com >> <javascript:_e(%7B%7D,'cvml','juraj2...@gmail.com');>> wrote: >> Hello, >> >> I am using MPI_Comm_spawn to dynamically create new processes from single >> manager process. Everything works fine when all the processes are running on >> the same node. But imposing restriction to run only a single process per >> node does not work. Below are the errors produced during multinode >> interactive session and multinode sbatch job. >> >> The system I am using is: Linux version 3.10.0-229.el7.x86_64 >> (buil...@kbuilder.dev.centos.org <>) (gcc version 4.8.2 20140120 (Red Hat >> 4.8.2-16) (GCC) ) >> I am using Open MPI 2.0.1 >> Slurm is version 15.08.9 >> >> What is preventing my jobs to spawn on multiple nodes? Does slurm requires >> some additional configuration to allow it? Is it issue on the MPI side, does >> it need to be compiled with some special flag (I have compiled it with >> --enable-mpi-fortran=all --with-pmi)? >> >> The code I am launching is here: https://github.com/goghino/dynamicMPI >> <https://github.com/goghino/dynamicMPI> >> >> Manager tries to launch one new process (./manager 1), the error produced by >> requesting each process to be located on different node (interactive >> session): >> $ salloc -N 2 >> $ cat my_hosts >> icsnode37 >> icsnode38 >> $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1 >> [manager]I'm running MPI 3.1 >> [manager]Runing on node icsnode37 >> icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0) >> icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0) >> [icsnode37:12614] *** Process received signal *** >> [icsnode37:12614] Signal: Aborted (6) >> [icsnode37:12614] Signal code: (-6) >> [icsnode38:32443] *** Process received signal *** >> [icsnode38:32443] Signal: Aborted (6) >> [icsnode38:32443] Signal code: (-6) >> >> The same example as above via sbatch job submission: >> $ cat job.sbatch >> #!/bin/bash >> >> #SBATCH --nodes=2 >> #SBATCH --ntasks-per-node=1 >> >> module load openmpi/2.0.1 >> srun -n 1 -N 1 ./manager 1 >> >> $ cat output.o >> [manager]I'm running MPI 3.1 >> [manager]Runing on node icsnode39 >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >> [icsnode39:9692] *** An error occurred in MPI_Comm_spawn >> [icsnode39:9692] *** reported by process [1007812608,0] >> [icsnode39:9692] *** on communicator MPI_COMM_SELF >> [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes >> [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [icsnode39:9692] *** and potentially your MPI job) >> In: PMI_Abort(50, N/A) >> slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20 >> *** >> srun: error: icsnode39: task 0: Exited with exit code 50 >> >> Thank for any feedback! >> >> Best regards, >> Juraj >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> <javascript:_e(%7B%7D,'cvml','users@lists.open-mpi.org');> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users