Hello,

I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the
spawn functionality
to work inside a for loop, but continue to get the error "too many retries
sending message to <addr>, giving up" somewhere down the line in the for
loop, seemingly because the processors are not being fully freed when
disconnecting/finishing. I found the orte/test/mpi/loop_spawn.c
<https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c>
example/test,
and it has the exact same problem. I also found this
<https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing
list post from ~ a month and a half ago.

Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the same
issue I am having (ie the loop_spawn example not working)? If so, do you
know if we can downgrade to e.g. 1.10.1 or another version? Or is there
another solution to fix this bug until you get a new release out (or is one
coming shortly to fix this maybe?)?

Below is the output of the loop_spawn test on our university's cluster,
which I know very little about in terms of architecture but can get
information if it's helpful. The large group of people who manage this
cluster are very good.

Thanks for your time.

Jason

mpiexec -np 5 loop_spawn
parent*******************************
parent: Launching MPI*
parent*******************************
parent: Launching MPI*
parent*******************************
parent: Launching MPI*
parent*******************************
parent: Launching MPI*
parent*******************************
parent: Launching MPI*
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #0 return : 0
Child: launch
Child merged rank = 5, size = 6
parent: MPI_Comm_spawn #0 rank 4, size 6
parent: MPI_Comm_spawn #0 rank 0, size 6
parent: MPI_Comm_spawn #0 rank 2, size 6
parent: MPI_Comm_spawn #0 rank 3, size 6
parent: MPI_Comm_spawn #0 rank 1, size 6
Child 329941: exiting
parent: MPI_Comm_spawn #1 return : 0
parent: MPI_Comm_spawn #1 return : 0
parent: MPI_Comm_spawn #1 return : 0
parent: MPI_Comm_spawn #1 return : 0
parent: MPI_Comm_spawn #1 return : 0
Child: launch
parent: MPI_Comm_spawn #1 rank 0, size 6
parent: MPI_Comm_spawn #1 rank 2, size 6
parent: MPI_Comm_spawn #1 rank 1, size 6
parent: MPI_Comm_spawn #1 rank 3, size 6
Child merged rank = 5, size = 6
parent: MPI_Comm_spawn #1 rank 4, size 6
Child 329945: exiting
parent: MPI_Comm_spawn #2 return : 0
parent: MPI_Comm_spawn #2 return : 0
parent: MPI_Comm_spawn #2 return : 0
parent: MPI_Comm_spawn #2 return : 0
parent: MPI_Comm_spawn #2 return : 0
Child: launch
parent: MPI_Comm_spawn #2 rank 3, size 6
parent: MPI_Comm_spawn #2 rank 0, size 6
parent: MPI_Comm_spawn #2 rank 2, size 6
Child merged rank = 5, size = 6
parent: MPI_Comm_spawn #2 rank 1, size 6
parent: MPI_Comm_spawn #2 rank 4, size 6
Child 329949: exiting
parent: MPI_Comm_spawn #3 return : 0
parent: MPI_Comm_spawn #3 return : 0
parent: MPI_Comm_spawn #3 return : 0
parent: MPI_Comm_spawn #3 return : 0
parent: MPI_Comm_spawn #3 return : 0
Child: launch
[node:port?] too many retries sending message to <addr>, giving up
-------------------------------------------------------
Child job 5 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[...],0]
  Exit code:    255
--------------------------------------------------------------------------

Reply via email to