While debugging a problem that is causing emission of a non-fatal OpenMPI error
message to stderr, the error message is followed by a line similar to the
following (I have help message aggregation turned on):

[myhost:10008] 17 more processes have sent help message some_file.txt / blah 
blah failed

The job that I am running is started as a single process (via SLURM using PMI)
that spawns 2 processes via MPI_Spawn; the number of processes reported in the
above line, however, is much larger than 2. Why would the number of processes
reporting an error be so big? When I examine the MPI processes in real time as 
they
run (e.g., via top), there never appear to be that many processes running.

I'm using OpenMPI 1.10.0 built on Ubuntu 14.04.3; as indicated by ompi_info, I
don't have multiple MPI threads enabled:

posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI progress: no, ORTE 
progress: yes, Event lib: yes)
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Reply via email to