Hello, I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn functionality to work inside a for loop, but continue to get the error "too many retries sending message to <addr>, giving up" somewhere down the line in the for loop, seemingly because the processors are not being fully freed when disconnecting/finishing. I found the orte/test/mpi/loop_spawn.c <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> example/test, and it has the exact same problem. I also found this <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing list post from ~ a month and a half ago.
Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the same issue I am having (ie the loop_spawn example not working)? If so, do you know if we can downgrade to e.g. 1.10.1 or another version? Or is there another solution to fix this bug until you get a new release out (or is one coming shortly to fix this maybe?)? Below is the output of the loop_spawn test on our university's cluster, which I know very little about in terms of architecture but can get information if it's helpful. The large group of people who manage this cluster are very good. Thanks for your time. Jason mpiexec -np 5 loop_spawn parent******************************* parent: Launching MPI* parent******************************* parent: Launching MPI* parent******************************* parent: Launching MPI* parent******************************* parent: Launching MPI* parent******************************* parent: Launching MPI* parent: MPI_Comm_spawn #0 return : 0 parent: MPI_Comm_spawn #0 return : 0 parent: MPI_Comm_spawn #0 return : 0 parent: MPI_Comm_spawn #0 return : 0 parent: MPI_Comm_spawn #0 return : 0 Child: launch Child merged rank = 5, size = 6 parent: MPI_Comm_spawn #0 rank 4, size 6 parent: MPI_Comm_spawn #0 rank 0, size 6 parent: MPI_Comm_spawn #0 rank 2, size 6 parent: MPI_Comm_spawn #0 rank 3, size 6 parent: MPI_Comm_spawn #0 rank 1, size 6 Child 329941: exiting parent: MPI_Comm_spawn #1 return : 0 parent: MPI_Comm_spawn #1 return : 0 parent: MPI_Comm_spawn #1 return : 0 parent: MPI_Comm_spawn #1 return : 0 parent: MPI_Comm_spawn #1 return : 0 Child: launch parent: MPI_Comm_spawn #1 rank 0, size 6 parent: MPI_Comm_spawn #1 rank 2, size 6 parent: MPI_Comm_spawn #1 rank 1, size 6 parent: MPI_Comm_spawn #1 rank 3, size 6 Child merged rank = 5, size = 6 parent: MPI_Comm_spawn #1 rank 4, size 6 Child 329945: exiting parent: MPI_Comm_spawn #2 return : 0 parent: MPI_Comm_spawn #2 return : 0 parent: MPI_Comm_spawn #2 return : 0 parent: MPI_Comm_spawn #2 return : 0 parent: MPI_Comm_spawn #2 return : 0 Child: launch parent: MPI_Comm_spawn #2 rank 3, size 6 parent: MPI_Comm_spawn #2 rank 0, size 6 parent: MPI_Comm_spawn #2 rank 2, size 6 Child merged rank = 5, size = 6 parent: MPI_Comm_spawn #2 rank 1, size 6 parent: MPI_Comm_spawn #2 rank 4, size 6 Child 329949: exiting parent: MPI_Comm_spawn #3 return : 0 parent: MPI_Comm_spawn #3 return : 0 parent: MPI_Comm_spawn #3 return : 0 parent: MPI_Comm_spawn #3 return : 0 parent: MPI_Comm_spawn #3 return : 0 Child: launch [node:port?] too many retries sending message to <addr>, giving up ------------------------------------------------------- Child job 5 terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[...],0] Exit code: 255 --------------------------------------------------------------------------