I've been using OpenMPI 1.8.4 manually built on Ubuntu 14.04.2 against the PMI libraries provided by the stock SLURM 2.6.5 Ubuntu packages. Although I am able to successfully run MPI jobs that use MPI_Comm_spawn via mpi4py 1.3.1 (also manually built against OpenMPI 1.8.4) to dynamically create processes when I launch those jobs via mpiexec directly, I can't seem to get SLURM to start them (I am able to use SLURM to successfully start jobs with a fixed number of processes, however). For example, attempting to run a job that spawns more than one process with
srun -n 1 python myprogram.py results in the following error: [huxley:24037] [[5176,1],0] ORTE_ERROR_LOG: Not available in file dpm_orte.c at line 1100 [huxley:24037] *** An error occurred in MPI_Comm_spawn [huxley:24037] *** reported by process [339214337,0] [huxley:24037] *** on communicator MPI_COMM_SELF [huxley:24037] *** MPI_ERR_UNKNOWN: unknown error [huxley:24037] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [huxley:24037] *** and potentially your MPI job) Running the same program with mpiexec -np 1 python myprogram.py works properly. Has anyone successfully used SLURM (possibly a more recent version than 2.6.5) to submit spawning OpenMPI jobs? If so, what might be causing the above error? -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/