> On Sep 20, 2015, at 2:30 PM, Lev Givon <l...@columbia.edu> wrote:
> 
> Received from Ralph Castain on Sun, Sep 20, 2015 at 05:08:10PM EDT:
>>> On Sep 20, 2015, at 12:57 PM, Lev Givon <l...@columbia.edu> wrote:
>>> 
>>> While debugging a problem that is causing emission of a non-fatal OpenMPI 
>>> error
>>> message to stderr, the error message is followed by a line similar to the
>>> following (I have help message aggregation turned on):
>>> 
>>> [myhost:10008] 17 more processes have sent help message some_file.txt / 
>>> blah blah failed
>>> 
>>> The job that I am running is started as a single process (via SLURM using 
>>> PMI)
>>> that spawns 2 processes via MPI_Spawn; the number of processes reported in 
>>> the
>>> above line, however, is much larger than 2. Why would the number of 
>>> processes
>>> reporting an error be so big? When I examine the MPI processes in real time 
>>> as they
>>> run (e.g., via top), there never appear to be that many processes running.
>>> 
>>> I'm using OpenMPI 1.10.0 built on Ubuntu 14.04.3; as indicated by 
>>> ompi_info, I
>>> don't have multiple MPI threads enabled:
>>> 
>>> posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI progress: no, ORTE 
>>> progress: yes, Event lib: yes)
> 
>> Just to be clear: you are starting the single process using “srun -n 1 
>> ./app”,
>> and the app calls MPI_Comm_spawn?
> 
> Yes.
> 
>> I’m not sure that’s really supported…I think there might be something in 
>> Slurm
>> behind that call, but I have no idea if it really works.
> 
> Well, the same question applies if I don't use SLURM and launch with mpiexec 
> -np
> 1. 
> 
> On a closer look, it seems that the "17" corresponds to the number of times 
> the
> error was emitted after its occurrence regardless of how many actual MPI 
> processes
> were running (each of the MPI processes spawned by my program iterates a 
> certain
> number of times and causes the error to occur during each iteration).

That is correct - if you tell us the error, we’d be happy to help diagnose. 
Otherwise, your analysis is correct.


> -- 
> Lev Givon
> Bionet Group | Neurokernel Project
> http://www.columbia.edu/~lev/
> http://lebedov.github.io/
> http://neurokernel.github.io/
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27637.php

Reply via email to