I dug into this a bit (with some help from others) and found that the spawn code appears to be working correctly - it is the test in orte/test that is wrong. The test has been correctly updated in the 2.x and master repos, but we failed to backport it to the 1.10 series. I have done so this morning, and it will be in the upcoming 1.10.3 release (out very soon).
> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org> wrote: > > No, that PR has nothing to do with loop_spawn. I’ll try to take a look at the > problem. > >> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu >> <mailto:maldo...@wisc.edu>> wrote: >> >> Hello, >> >> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn >> functionality to work inside a for loop, but continue to get the error "too >> many retries sending message to <addr>, giving up" somewhere down the line >> in the for loop, seemingly because the processors are not being fully freed >> when disconnecting/finishing. I found the orte/test/mpi/loop_spawn.c >> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> >> example/test, and it has the exact same problem. I also found this >> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing >> list post from ~ a month and a half ago. >> >> Is this PR (https://github.com/open-mpi/ompi/pull/1473 >> <https://github.com/open-mpi/ompi/pull/1473>) about the same issue I am >> having (ie the loop_spawn example not working)? If so, do you know if we can >> downgrade to e.g. 1.10.1 or another version? Or is there another solution to >> fix this bug until you get a new release out (or is one coming shortly to >> fix this maybe?)? >> >> Below is the output of the loop_spawn test on our university's cluster, >> which I know very little about in terms of architecture but can get >> information if it's helpful. The large group of people who manage this >> cluster are very good. >> >> Thanks for your time. >> >> Jason >> >> mpiexec -np 5 loop_spawn >> parent******************************* >> parent: Launching MPI* >> parent******************************* >> parent: Launching MPI* >> parent******************************* >> parent: Launching MPI* >> parent******************************* >> parent: Launching MPI* >> parent******************************* >> parent: Launching MPI* >> parent: MPI_Comm_spawn #0 return : 0 >> parent: MPI_Comm_spawn #0 return : 0 >> parent: MPI_Comm_spawn #0 return : 0 >> parent: MPI_Comm_spawn #0 return : 0 >> parent: MPI_Comm_spawn #0 return : 0 >> Child: launch >> Child merged rank = 5, size = 6 >> parent: MPI_Comm_spawn #0 rank 4, size 6 >> parent: MPI_Comm_spawn #0 rank 0, size 6 >> parent: MPI_Comm_spawn #0 rank 2, size 6 >> parent: MPI_Comm_spawn #0 rank 3, size 6 >> parent: MPI_Comm_spawn #0 rank 1, size 6 >> Child 329941: exiting >> parent: MPI_Comm_spawn #1 return : 0 >> parent: MPI_Comm_spawn #1 return : 0 >> parent: MPI_Comm_spawn #1 return : 0 >> parent: MPI_Comm_spawn #1 return : 0 >> parent: MPI_Comm_spawn #1 return : 0 >> Child: launch >> parent: MPI_Comm_spawn #1 rank 0, size 6 >> parent: MPI_Comm_spawn #1 rank 2, size 6 >> parent: MPI_Comm_spawn #1 rank 1, size 6 >> parent: MPI_Comm_spawn #1 rank 3, size 6 >> Child merged rank = 5, size = 6 >> parent: MPI_Comm_spawn #1 rank 4, size 6 >> Child 329945: exiting >> parent: MPI_Comm_spawn #2 return : 0 >> parent: MPI_Comm_spawn #2 return : 0 >> parent: MPI_Comm_spawn #2 return : 0 >> parent: MPI_Comm_spawn #2 return : 0 >> parent: MPI_Comm_spawn #2 return : 0 >> Child: launch >> parent: MPI_Comm_spawn #2 rank 3, size 6 >> parent: MPI_Comm_spawn #2 rank 0, size 6 >> parent: MPI_Comm_spawn #2 rank 2, size 6 >> Child merged rank = 5, size = 6 >> parent: MPI_Comm_spawn #2 rank 1, size 6 >> parent: MPI_Comm_spawn #2 rank 4, size 6 >> Child 329949: exiting >> parent: MPI_Comm_spawn #3 return : 0 >> parent: MPI_Comm_spawn #3 return : 0 >> parent: MPI_Comm_spawn #3 return : 0 >> parent: MPI_Comm_spawn #3 return : 0 >> parent: MPI_Comm_spawn #3 return : 0 >> Child: launch >> [node:port?] too many retries sending message to <addr>, giving up >> ------------------------------------------------------- >> Child job 5 terminated normally, but 1 process returned >> a non-zero exit code.. Per user-direction, the job has been aborted. >> ------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpiexec detected that one or more processes exited with non-zero status, >> thus causing >> the job to be terminated. The first process to do so was: >> >> Process name: [[...],0] >> Exit code: 255 >> -------------------------------------------------------------------------- >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29425.php >