No, that PR has nothing to do with loop_spawn. I’ll try to take a look at the problem.
> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu> wrote: > > Hello, > > I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn > functionality to work inside a for loop, but continue to get the error "too > many retries sending message to <addr>, giving up" somewhere down the line in > the for loop, seemingly because the processors are not being fully freed when > disconnecting/finishing. I found the orte/test/mpi/loop_spawn.c > <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> > example/test, and it has the exact same problem. I also found this > <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing > list post from ~ a month and a half ago. > > Is this PR (https://github.com/open-mpi/ompi/pull/1473 > <https://github.com/open-mpi/ompi/pull/1473>) about the same issue I am > having (ie the loop_spawn example not working)? If so, do you know if we can > downgrade to e.g. 1.10.1 or another version? Or is there another solution to > fix this bug until you get a new release out (or is one coming shortly to fix > this maybe?)? > > Below is the output of the loop_spawn test on our university's cluster, which > I know very little about in terms of architecture but can get information if > it's helpful. The large group of people who manage this cluster are very good. > > Thanks for your time. > > Jason > > mpiexec -np 5 loop_spawn > parent******************************* > parent: Launching MPI* > parent******************************* > parent: Launching MPI* > parent******************************* > parent: Launching MPI* > parent******************************* > parent: Launching MPI* > parent******************************* > parent: Launching MPI* > parent: MPI_Comm_spawn #0 return : 0 > parent: MPI_Comm_spawn #0 return : 0 > parent: MPI_Comm_spawn #0 return : 0 > parent: MPI_Comm_spawn #0 return : 0 > parent: MPI_Comm_spawn #0 return : 0 > Child: launch > Child merged rank = 5, size = 6 > parent: MPI_Comm_spawn #0 rank 4, size 6 > parent: MPI_Comm_spawn #0 rank 0, size 6 > parent: MPI_Comm_spawn #0 rank 2, size 6 > parent: MPI_Comm_spawn #0 rank 3, size 6 > parent: MPI_Comm_spawn #0 rank 1, size 6 > Child 329941: exiting > parent: MPI_Comm_spawn #1 return : 0 > parent: MPI_Comm_spawn #1 return : 0 > parent: MPI_Comm_spawn #1 return : 0 > parent: MPI_Comm_spawn #1 return : 0 > parent: MPI_Comm_spawn #1 return : 0 > Child: launch > parent: MPI_Comm_spawn #1 rank 0, size 6 > parent: MPI_Comm_spawn #1 rank 2, size 6 > parent: MPI_Comm_spawn #1 rank 1, size 6 > parent: MPI_Comm_spawn #1 rank 3, size 6 > Child merged rank = 5, size = 6 > parent: MPI_Comm_spawn #1 rank 4, size 6 > Child 329945: exiting > parent: MPI_Comm_spawn #2 return : 0 > parent: MPI_Comm_spawn #2 return : 0 > parent: MPI_Comm_spawn #2 return : 0 > parent: MPI_Comm_spawn #2 return : 0 > parent: MPI_Comm_spawn #2 return : 0 > Child: launch > parent: MPI_Comm_spawn #2 rank 3, size 6 > parent: MPI_Comm_spawn #2 rank 0, size 6 > parent: MPI_Comm_spawn #2 rank 2, size 6 > Child merged rank = 5, size = 6 > parent: MPI_Comm_spawn #2 rank 1, size 6 > parent: MPI_Comm_spawn #2 rank 4, size 6 > Child 329949: exiting > parent: MPI_Comm_spawn #3 return : 0 > parent: MPI_Comm_spawn #3 return : 0 > parent: MPI_Comm_spawn #3 return : 0 > parent: MPI_Comm_spawn #3 return : 0 > parent: MPI_Comm_spawn #3 return : 0 > Child: launch > [node:port?] too many retries sending message to <addr>, giving up > ------------------------------------------------------- > Child job 5 terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpiexec detected that one or more processes exited with non-zero status, thus > causing > the job to be terminated. The first process to do so was: > > Process name: [[...],0] > Exit code: 255 > -------------------------------------------------------------------------- > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29425.php