I've been using OpenMPI 1.8.4 manually built on Ubuntu 14.04.2 against the PMI
libraries provided by the stock SLURM 2.6.5 Ubuntu packages. Although I am able
to successfully run MPI jobs that use MPI_Comm_spawn via mpi4py 1.3.1 (also
manually built against OpenMPI 1.8.4) to dynamically create processes when I
launch those jobs via mpiexec directly, I can't seem to get SLURM to start them
(I am able to use SLURM to successfully start jobs with a fixed number of
processes, however). For example, attempting to run a job that spawns more than
one process with

srun -n 1 python myprogram.py

results in the following error:

[huxley:24037] [[5176,1],0] ORTE_ERROR_LOG: Not available in file dpm_orte.c at 
line 1100
[huxley:24037] *** An error occurred in MPI_Comm_spawn
[huxley:24037] *** reported by process [339214337,0]
[huxley:24037] *** on communicator MPI_COMM_SELF
[huxley:24037] *** MPI_ERR_UNKNOWN: unknown error
[huxley:24037] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[huxley:24037] ***    and potentially your MPI job)

Running the same program with

mpiexec -np 1 python myprogram.py

works properly.

Has anyone successfully used SLURM (possibly a more recent version than 2.6.5)
to submit spawning OpenMPI jobs? If so, what might be causing the above error?
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Reply via email to