I dug into this a bit (with some help from others) and found that the spawn 
code appears to be working correctly - it is the test in orte/test that is 
wrong. The test has been correctly updated in the 2.x and master repos, but we 
failed to backport it to the 1.10 series. I have done so this morning, and it 
will be in the upcoming 1.10.3 release (out very soon).


> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> No, that PR has nothing to do with loop_spawn. I’ll try to take a look at the 
> problem.
> 
>> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu 
>> <mailto:maldo...@wisc.edu>> wrote:
>> 
>> Hello,
>> 
>> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn 
>> functionality to work inside a for loop, but continue to get the error "too 
>> many retries sending message to <addr>, giving up" somewhere down the line 
>> in the for loop, seemingly because the processors are not being fully freed 
>> when disconnecting/finishing. I found the orte/test/mpi/loop_spawn.c 
>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> 
>> example/test, and it has the exact same problem. I also found this 
>> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing 
>> list post from ~ a month and a half ago.
>> 
>> Is this PR (https://github.com/open-mpi/ompi/pull/1473 
>> <https://github.com/open-mpi/ompi/pull/1473>) about the same issue I am 
>> having (ie the loop_spawn example not working)? If so, do you know if we can 
>> downgrade to e.g. 1.10.1 or another version? Or is there another solution to 
>> fix this bug until you get a new release out (or is one coming shortly to 
>> fix this maybe?)?
>> 
>> Below is the output of the loop_spawn test on our university's cluster, 
>> which I know very little about in terms of architecture but can get 
>> information if it's helpful. The large group of people who manage this 
>> cluster are very good.
>> 
>> Thanks for your time.
>> 
>> Jason
>> 
>> mpiexec -np 5 loop_spawn
>> parent*******************************
>> parent: Launching MPI*
>> parent*******************************
>> parent: Launching MPI*
>> parent*******************************
>> parent: Launching MPI*
>> parent*******************************
>> parent: Launching MPI*
>> parent*******************************
>> parent: Launching MPI*
>> parent: MPI_Comm_spawn #0 return : 0
>> parent: MPI_Comm_spawn #0 return : 0
>> parent: MPI_Comm_spawn #0 return : 0
>> parent: MPI_Comm_spawn #0 return : 0
>> parent: MPI_Comm_spawn #0 return : 0
>> Child: launch
>> Child merged rank = 5, size = 6
>> parent: MPI_Comm_spawn #0 rank 4, size 6
>> parent: MPI_Comm_spawn #0 rank 0, size 6
>> parent: MPI_Comm_spawn #0 rank 2, size 6
>> parent: MPI_Comm_spawn #0 rank 3, size 6
>> parent: MPI_Comm_spawn #0 rank 1, size 6
>> Child 329941: exiting
>> parent: MPI_Comm_spawn #1 return : 0
>> parent: MPI_Comm_spawn #1 return : 0
>> parent: MPI_Comm_spawn #1 return : 0
>> parent: MPI_Comm_spawn #1 return : 0
>> parent: MPI_Comm_spawn #1 return : 0
>> Child: launch
>> parent: MPI_Comm_spawn #1 rank 0, size 6
>> parent: MPI_Comm_spawn #1 rank 2, size 6
>> parent: MPI_Comm_spawn #1 rank 1, size 6
>> parent: MPI_Comm_spawn #1 rank 3, size 6
>> Child merged rank = 5, size = 6
>> parent: MPI_Comm_spawn #1 rank 4, size 6
>> Child 329945: exiting
>> parent: MPI_Comm_spawn #2 return : 0
>> parent: MPI_Comm_spawn #2 return : 0
>> parent: MPI_Comm_spawn #2 return : 0
>> parent: MPI_Comm_spawn #2 return : 0
>> parent: MPI_Comm_spawn #2 return : 0
>> Child: launch
>> parent: MPI_Comm_spawn #2 rank 3, size 6
>> parent: MPI_Comm_spawn #2 rank 0, size 6
>> parent: MPI_Comm_spawn #2 rank 2, size 6
>> Child merged rank = 5, size = 6
>> parent: MPI_Comm_spawn #2 rank 1, size 6
>> parent: MPI_Comm_spawn #2 rank 4, size 6
>> Child 329949: exiting
>> parent: MPI_Comm_spawn #3 return : 0
>> parent: MPI_Comm_spawn #3 return : 0
>> parent: MPI_Comm_spawn #3 return : 0
>> parent: MPI_Comm_spawn #3 return : 0
>> parent: MPI_Comm_spawn #3 return : 0
>> Child: launch
>> [node:port?] too many retries sending message to <addr>, giving up
>> -------------------------------------------------------
>> Child job 5 terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpiexec detected that one or more processes exited with non-zero status, 
>> thus causing
>> the job to be terminated. The first process to do so was:
>> 
>>   Process name: [[...],0]
>>   Exit code:    255
>> --------------------------------------------------------------------------
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/06/29425.php
> 

Reply via email to