Thanks Ralph for all the help. I will do that until it gets fixed.

Nathan, I am very very interested in this working because we are developing
some new cool code for research in materials science. This is the last
piece of the puzzle for us I believe. I can use TCP for now though of
course. While I doubt I can help, if you are having trouble reproducing the
problem or something else, feel free to let me know. I understand you
probably have a bunch of other things on your plate too, but if there is
something I can do to speed up the process, just let me know.

Lastly, what are the chances there is a place/website/etc where I can watch
to see when the fix for this has been made?

Thanks everyone!
Jason

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
maldo...@wisc.edu
608-295-5532

On Tue, Jun 14, 2016 at 4:51 PM, Ralph Castain <r...@open-mpi.org> wrote:

> You don’t want to always use those options as your performance will take a
> hit - TCP vs Infiniband isn’t a good option. Sadly, this is something we
> need someone like Nathan to address as it is a bug in the code base, and in
> an area I’m not familiar with
>
> For now, just use TCP so you can move forward
>
>
> On Jun 14, 2016, at 2:14 PM, Jason Maldonis <maldo...@wisc.edu> wrote:
>
> Ralph, The problem *does* go away if I add "-mca btl tcp,sm,self" to the
> mpiexec cmd line. (By the way, I am using mpiexec rather than mpirun; do
> you recommend one over the other?) Will you tell me what this means for me?
> For example, should I always append these arguments to mpiexec for my
> non-test jobs as well?   I do not know what you mean by fabric
> unfortunately, but I can give you some system information (see end of
> email). Unfortunately I am not a system admin so I do not have sudo rights.
> Just let me know if I can tell you something more specific though and I
> will get it.
>
> Nathan,  Thank you for your response. Unfortunately I have no idea what
> that means :(  I can forward that to our cluster managers, but I do not
> know if that is enough information for them to understand what they might
> need to do to help me with this issue.
>
> $ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                20
> On-line CPU(s) list:   0-19
> Thread(s) per core:    1
> Core(s) per socket:    10
> Socket(s):             2
> NUMA node(s):          2
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 63
> Stepping:              2
> CPU MHz:               2594.159
> BogoMIPS:              5187.59
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              25600K
> NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
> NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
>
> Thanks,
> Jason
>
> Jason Maldonis
> Research Assistant of Professor Paul Voyles
> Materials Science Grad Student
> University of Wisconsin, Madison
> 1509 University Ave, Rm M142
> Madison, WI 53706
> maldo...@wisc.edu
> 608-295-5532
>
> On Tue, Jun 14, 2016 at 1:27 PM, Nathan Hjelm <hje...@me.com> wrote:
>
>> That message is coming from udcm in the openib btl. It indicates some
>> sort of failure in the connection mechanism. It can happen if the listening
>> thread no longer exists or is taking too long to process messages.
>>
>> -Nathan
>>
>>
>> On Jun 14, 2016, at 12:20 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>> Hmm…I’m unable to replicate a problem on my machines. What fabric are you
>> using? Does the problem go away if you add “-mca btl tcp,sm,self” to the
>> mpirun cmd line?
>>
>> On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu> wrote:
>> Hi Ralph, et. al,
>>
>> Great, thank you for the help. I downloaded the mpi loop spawn test
>> directly from what I think is the master repo on github:
>> https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c
>> I am still using the mpi code from 1.10.2, however.
>>
>> Is that test updated with the correct code? If so, I am still getting the
>> same "too many retries sending message to 0x0184:0x00001d27, giving up"
>> errors. I also just downloaded the June 14 nightly tarball (7.79MB) from:
>> https://www.open-mpi.org/nightly/v2.x/ and I get the same error.
>>
>> Could you please point me to the correct code?
>>
>> If you need me to provide more information please let me know.
>>
>> Thank you,
>> Jason
>>
>> Jason Maldonis
>> Research Assistant of Professor Paul Voyles
>> Materials Science Grad Student
>> University of Wisconsin, Madison
>> 1509 University Ave, Rm M142
>> Madison, WI 53706
>> maldo...@wisc.edu
>> 608-295-5532
>>
>> On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> I dug into this a bit (with some help from others) and found that the
>>> spawn code appears to be working correctly - it is the test in orte/test
>>> that is wrong. The test has been correctly updated in the 2.x and master
>>> repos, but we failed to backport it to the 1.10 series. I have done so this
>>> morning, and it will be in the upcoming 1.10.3 release (out very soon).
>>>
>>>
>>> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>> No, that PR has nothing to do with loop_spawn. I’ll try to take a look
>>> at the problem.
>>>
>>> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu> wrote:
>>>
>>> Hello,
>>>
>>> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the
>>> spawn functionality to work inside a for loop, but continue to get the
>>> error "too many retries sending message to <addr>, giving up" somewhere
>>> down the line in the for loop, seemingly because the processors are not
>>> being fully freed when disconnecting/finishing. I found the
>>> orte/test/mpi/loop_spawn.c
>>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c>
>>>  example/test, and it has the exact same problem. I also found this
>>> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing
>>> list post from ~ a month and a half ago.
>>>
>>> Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the same
>>> issue I am having (ie the loop_spawn example not working)? If so, do
>>> you know if we can downgrade to e.g. 1.10.1 or another version? Or is there
>>> another solution to fix this bug until you get a new release out (or is one
>>> coming shortly to fix this maybe?)?
>>>
>>> Below is the output of the loop_spawn test on our university's cluster,
>>> which I know very little about in terms of architecture but can get
>>> information if it's helpful. The large group of people who manage this
>>> cluster are very good.
>>>
>>> Thanks for your time.
>>>
>>> Jason
>>>
>>> mpiexec -np 5 loop_spawn
>>> parent*******************************
>>> parent: Launching MPI*
>>> parent*******************************
>>> parent: Launching MPI*
>>> parent*******************************
>>> parent: Launching MPI*
>>> parent*******************************
>>> parent: Launching MPI*
>>> parent*******************************
>>> parent: Launching MPI*
>>> parent: MPI_Comm_spawn #0 return : 0
>>> parent: MPI_Comm_spawn #0 return : 0
>>> parent: MPI_Comm_spawn #0 return : 0
>>> parent: MPI_Comm_spawn #0 return : 0
>>> parent: MPI_Comm_spawn #0 return : 0
>>> Child: launch
>>> Child merged rank = 5, size = 6
>>> parent: MPI_Comm_spawn #0 rank 4, size 6
>>> parent: MPI_Comm_spawn #0 rank 0, size 6
>>> parent: MPI_Comm_spawn #0 rank 2, size 6
>>> parent: MPI_Comm_spawn #0 rank 3, size 6
>>> parent: MPI_Comm_spawn #0 rank 1, size 6
>>> Child 329941: exiting
>>> parent: MPI_Comm_spawn #1 return : 0
>>> parent: MPI_Comm_spawn #1 return : 0
>>> parent: MPI_Comm_spawn #1 return : 0
>>> parent: MPI_Comm_spawn #1 return : 0
>>> parent: MPI_Comm_spawn #1 return : 0
>>> Child: launch
>>> parent: MPI_Comm_spawn #1 rank 0, size 6
>>> parent: MPI_Comm_spawn #1 rank 2, size 6
>>> parent: MPI_Comm_spawn #1 rank 1, size 6
>>> parent: MPI_Comm_spawn #1 rank 3, size 6
>>> Child merged rank = 5, size = 6
>>> parent: MPI_Comm_spawn #1 rank 4, size 6
>>> Child 329945: exiting
>>> parent: MPI_Comm_spawn #2 return : 0
>>> parent: MPI_Comm_spawn #2 return : 0
>>> parent: MPI_Comm_spawn #2 return : 0
>>> parent: MPI_Comm_spawn #2 return : 0
>>> parent: MPI_Comm_spawn #2 return : 0
>>> Child: launch
>>> parent: MPI_Comm_spawn #2 rank 3, size 6
>>> parent: MPI_Comm_spawn #2 rank 0, size 6
>>> parent: MPI_Comm_spawn #2 rank 2, size 6
>>> Child merged rank = 5, size = 6
>>> parent: MPI_Comm_spawn #2 rank 1, size 6
>>> parent: MPI_Comm_spawn #2 rank 4, size 6
>>> Child 329949: exiting
>>> parent: MPI_Comm_spawn #3 return : 0
>>> parent: MPI_Comm_spawn #3 return : 0
>>> parent: MPI_Comm_spawn #3 return : 0
>>> parent: MPI_Comm_spawn #3 return : 0
>>> parent: MPI_Comm_spawn #3 return : 0
>>> Child: launch
>>> [node:port?] too many retries sending message to <addr>, giving up
>>> -------------------------------------------------------
>>> Child job 5 terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpiexec detected that one or more processes exited with non-zero status, 
>>> thus causing
>>> the job to be terminated. The first process to do so was:
>>>
>>>   Process name: [[...],0]
>>>   Exit code:    255
>>> --------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/06/29435.php
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29438.php
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29439.php
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29440.php
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29444.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29445.php
>

Reply via email to