No, that PR has nothing to do with loop_spawn. I’ll try to take a look at the 
problem.

> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu> wrote:
> 
> Hello,
> 
> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn 
> functionality to work inside a for loop, but continue to get the error "too 
> many retries sending message to <addr>, giving up" somewhere down the line in 
> the for loop, seemingly because the processors are not being fully freed when 
> disconnecting/finishing. I found the orte/test/mpi/loop_spawn.c 
> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> 
> example/test, and it has the exact same problem. I also found this 
> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing 
> list post from ~ a month and a half ago.
> 
> Is this PR (https://github.com/open-mpi/ompi/pull/1473 
> <https://github.com/open-mpi/ompi/pull/1473>) about the same issue I am 
> having (ie the loop_spawn example not working)? If so, do you know if we can 
> downgrade to e.g. 1.10.1 or another version? Or is there another solution to 
> fix this bug until you get a new release out (or is one coming shortly to fix 
> this maybe?)?
> 
> Below is the output of the loop_spawn test on our university's cluster, which 
> I know very little about in terms of architecture but can get information if 
> it's helpful. The large group of people who manage this cluster are very good.
> 
> Thanks for your time.
> 
> Jason
> 
> mpiexec -np 5 loop_spawn
> parent*******************************
> parent: Launching MPI*
> parent*******************************
> parent: Launching MPI*
> parent*******************************
> parent: Launching MPI*
> parent*******************************
> parent: Launching MPI*
> parent*******************************
> parent: Launching MPI*
> parent: MPI_Comm_spawn #0 return : 0
> parent: MPI_Comm_spawn #0 return : 0
> parent: MPI_Comm_spawn #0 return : 0
> parent: MPI_Comm_spawn #0 return : 0
> parent: MPI_Comm_spawn #0 return : 0
> Child: launch
> Child merged rank = 5, size = 6
> parent: MPI_Comm_spawn #0 rank 4, size 6
> parent: MPI_Comm_spawn #0 rank 0, size 6
> parent: MPI_Comm_spawn #0 rank 2, size 6
> parent: MPI_Comm_spawn #0 rank 3, size 6
> parent: MPI_Comm_spawn #0 rank 1, size 6
> Child 329941: exiting
> parent: MPI_Comm_spawn #1 return : 0
> parent: MPI_Comm_spawn #1 return : 0
> parent: MPI_Comm_spawn #1 return : 0
> parent: MPI_Comm_spawn #1 return : 0
> parent: MPI_Comm_spawn #1 return : 0
> Child: launch
> parent: MPI_Comm_spawn #1 rank 0, size 6
> parent: MPI_Comm_spawn #1 rank 2, size 6
> parent: MPI_Comm_spawn #1 rank 1, size 6
> parent: MPI_Comm_spawn #1 rank 3, size 6
> Child merged rank = 5, size = 6
> parent: MPI_Comm_spawn #1 rank 4, size 6
> Child 329945: exiting
> parent: MPI_Comm_spawn #2 return : 0
> parent: MPI_Comm_spawn #2 return : 0
> parent: MPI_Comm_spawn #2 return : 0
> parent: MPI_Comm_spawn #2 return : 0
> parent: MPI_Comm_spawn #2 return : 0
> Child: launch
> parent: MPI_Comm_spawn #2 rank 3, size 6
> parent: MPI_Comm_spawn #2 rank 0, size 6
> parent: MPI_Comm_spawn #2 rank 2, size 6
> Child merged rank = 5, size = 6
> parent: MPI_Comm_spawn #2 rank 1, size 6
> parent: MPI_Comm_spawn #2 rank 4, size 6
> Child 329949: exiting
> parent: MPI_Comm_spawn #3 return : 0
> parent: MPI_Comm_spawn #3 return : 0
> parent: MPI_Comm_spawn #3 return : 0
> parent: MPI_Comm_spawn #3 return : 0
> parent: MPI_Comm_spawn #3 return : 0
> Child: launch
> [node:port?] too many retries sending message to <addr>, giving up
> -------------------------------------------------------
> Child job 5 terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[...],0]
>   Exit code:    255
> --------------------------------------------------------------------------
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29425.php

Reply via email to