Received from Ralph Castain on Sun, Sep 20, 2015 at 05:08:10PM EDT:
> > On Sep 20, 2015, at 12:57 PM, Lev Givon <l...@columbia.edu> wrote:
> > 
> > While debugging a problem that is causing emission of a non-fatal OpenMPI 
> > error
> > message to stderr, the error message is followed by a line similar to the
> > following (I have help message aggregation turned on):
> > 
> > [myhost:10008] 17 more processes have sent help message some_file.txt / 
> > blah blah failed
> > 
> > The job that I am running is started as a single process (via SLURM using 
> > PMI)
> > that spawns 2 processes via MPI_Spawn; the number of processes reported in 
> > the
> > above line, however, is much larger than 2. Why would the number of 
> > processes
> > reporting an error be so big? When I examine the MPI processes in real time 
> > as they
> > run (e.g., via top), there never appear to be that many processes running.
> > 
> > I'm using OpenMPI 1.10.0 built on Ubuntu 14.04.3; as indicated by 
> > ompi_info, I
> > don't have multiple MPI threads enabled:
> > 
> > posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI progress: no, ORTE 
> > progress: yes, Event lib: yes)

> Just to be clear: you are starting the single process using “srun -n 1 ./app”,
> and the app calls MPI_Comm_spawn?

Yes.

> I’m not sure that’s really supported…I think there might be something in Slurm
> behind that call, but I have no idea if it really works.

Well, the same question applies if I don't use SLURM and launch with mpiexec -np
1. 

On a closer look, it seems that the "17" corresponds to the number of times the
error was emitted after its occurrence regardless of how many actual MPI 
processes
were running (each of the MPI processes spawned by my program iterates a certain
number of times and causes the error to occur during each iteration).
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Reply via email to