Jason,

How many nodes are you running on ?

Since you have an IB network, IB is used for intra node communication
between tasks that are not part of the same OpenMPI job (read spawn group)
I can make a simple patch to use tcp instead of IB for these intra node
communication,
Let me know if you are willing to give it a try

Cheers,

Gilles

On Wednesday, June 15, 2016, Jason Maldonis <maldo...@wisc.edu> wrote:

> Thanks Ralph for all the help. I will do that until it gets fixed.
>
> Nathan, I am very very interested in this working because we are
> developing some new cool code for research in materials science. This is
> the last piece of the puzzle for us I believe. I can use TCP for now though
> of course. While I doubt I can help, if you are having trouble reproducing
> the problem or something else, feel free to let me know. I understand you
> probably have a bunch of other things on your plate too, but if there is
> something I can do to speed up the process, just let me know.
>
> Lastly, what are the chances there is a place/website/etc where I can
> watch to see when the fix for this has been made?
>
> Thanks everyone!
> Jason
>
> Jason Maldonis
> Research Assistant of Professor Paul Voyles
> Materials Science Grad Student
> University of Wisconsin, Madison
> 1509 University Ave, Rm M142
> Madison, WI 53706
> maldo...@wisc.edu <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');>
> 608-295-5532
>
> On Tue, Jun 14, 2016 at 4:51 PM, Ralph Castain <r...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>
>> You don’t want to always use those options as your performance will take
>> a hit - TCP vs Infiniband isn’t a good option. Sadly, this is something we
>> need someone like Nathan to address as it is a bug in the code base, and in
>> an area I’m not familiar with
>>
>> For now, just use TCP so you can move forward
>>
>>
>> On Jun 14, 2016, at 2:14 PM, Jason Maldonis <maldo...@wisc.edu
>> <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');>> wrote:
>>
>> Ralph, The problem *does* go away if I add "-mca btl tcp,sm,self" to the
>> mpiexec cmd line. (By the way, I am using mpiexec rather than mpirun; do
>> you recommend one over the other?) Will you tell me what this means for me?
>> For example, should I always append these arguments to mpiexec for my
>> non-test jobs as well?   I do not know what you mean by fabric
>> unfortunately, but I can give you some system information (see end of
>> email). Unfortunately I am not a system admin so I do not have sudo rights.
>> Just let me know if I can tell you something more specific though and I
>> will get it.
>>
>> Nathan,  Thank you for your response. Unfortunately I have no idea what
>> that means :(  I can forward that to our cluster managers, but I do not
>> know if that is enough information for them to understand what they might
>> need to do to help me with this issue.
>>
>> $ lscpu
>> Architecture:          x86_64
>> CPU op-mode(s):        32-bit, 64-bit
>> Byte Order:            Little Endian
>> CPU(s):                20
>> On-line CPU(s) list:   0-19
>> Thread(s) per core:    1
>> Core(s) per socket:    10
>> Socket(s):             2
>> NUMA node(s):          2
>> Vendor ID:             GenuineIntel
>> CPU family:            6
>> Model:                 63
>> Stepping:              2
>> CPU MHz:               2594.159
>> BogoMIPS:              5187.59
>> Virtualization:        VT-x
>> L1d cache:             32K
>> L1i cache:             32K
>> L2 cache:              256K
>> L3 cache:              25600K
>> NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
>> NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
>>
>> Thanks,
>> Jason
>>
>> Jason Maldonis
>> Research Assistant of Professor Paul Voyles
>> Materials Science Grad Student
>> University of Wisconsin, Madison
>> 1509 University Ave, Rm M142
>> Madison, WI 53706
>> maldo...@wisc.edu <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');>
>> 608-295-5532
>>
>> On Tue, Jun 14, 2016 at 1:27 PM, Nathan Hjelm <hje...@me.com
>> <javascript:_e(%7B%7D,'cvml','hje...@me.com');>> wrote:
>>
>>> That message is coming from udcm in the openib btl. It indicates some
>>> sort of failure in the connection mechanism. It can happen if the listening
>>> thread no longer exists or is taking too long to process messages.
>>>
>>> -Nathan
>>>
>>>
>>> On Jun 14, 2016, at 12:20 PM, Ralph Castain <r...@open-mpi.org
>>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>>>
>>> Hmm…I’m unable to replicate a problem on my machines. What fabric are
>>> you using? Does the problem go away if you add “-mca btl tcp,sm,self” to
>>> the mpirun cmd line?
>>>
>>> On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu
>>> <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');>> wrote:
>>> Hi Ralph, et. al,
>>>
>>> Great, thank you for the help. I downloaded the mpi loop spawn test
>>> directly from what I think is the master repo on github:
>>> https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c
>>> I am still using the mpi code from 1.10.2, however.
>>>
>>> Is that test updated with the correct code? If so, I am still getting
>>> the same "too many retries sending message to 0x0184:0x00001d27, giving up"
>>> errors. I also just downloaded the June 14 nightly tarball (7.79MB) from:
>>> https://www.open-mpi.org/nightly/v2.x/ and I get the same error.
>>>
>>> Could you please point me to the correct code?
>>>
>>> If you need me to provide more information please let me know.
>>>
>>> Thank you,
>>> Jason
>>>
>>> Jason Maldonis
>>> Research Assistant of Professor Paul Voyles
>>> Materials Science Grad Student
>>> University of Wisconsin, Madison
>>> 1509 University Ave, Rm M142
>>> Madison, WI 53706
>>> maldo...@wisc.edu <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');>
>>> 608-295-5532
>>>
>>> On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org
>>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>>>
>>>> I dug into this a bit (with some help from others) and found that the
>>>> spawn code appears to be working correctly - it is the test in orte/test
>>>> that is wrong. The test has been correctly updated in the 2.x and master
>>>> repos, but we failed to backport it to the 1.10 series. I have done so this
>>>> morning, and it will be in the upcoming 1.10.3 release (out very soon).
>>>>
>>>>
>>>> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org
>>>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>>>>
>>>> No, that PR has nothing to do with loop_spawn. I’ll try to take a look
>>>> at the problem.
>>>>
>>>> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu
>>>> <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the
>>>> spawn functionality to work inside a for loop, but continue to get the
>>>> error "too many retries sending message to <addr>, giving up" somewhere
>>>> down the line in the for loop, seemingly because the processors are not
>>>> being fully freed when disconnecting/finishing. I found the
>>>> orte/test/mpi/loop_spawn.c
>>>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c>
>>>>  example/test, and it has the exact same problem. I also found this
>>>> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing
>>>> list post from ~ a month and a half ago.
>>>>
>>>> Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the same
>>>> issue I am having (ie the loop_spawn example not working)? If so, do
>>>> you know if we can downgrade to e.g. 1.10.1 or another version? Or is there
>>>> another solution to fix this bug until you get a new release out (or is one
>>>> coming shortly to fix this maybe?)?
>>>>
>>>> Below is the output of the loop_spawn test on our university's
>>>> cluster, which I know very little about in terms of architecture but can
>>>> get information if it's helpful. The large group of people who manage this
>>>> cluster are very good.
>>>>
>>>> Thanks for your time.
>>>>
>>>> Jason
>>>>
>>>> mpiexec -np 5 loop_spawn
>>>> parent*******************************
>>>> parent: Launching MPI*
>>>> parent*******************************
>>>> parent: Launching MPI*
>>>> parent*******************************
>>>> parent: Launching MPI*
>>>> parent*******************************
>>>> parent: Launching MPI*
>>>> parent*******************************
>>>> parent: Launching MPI*
>>>> parent: MPI_Comm_spawn #0 return : 0
>>>> parent: MPI_Comm_spawn #0 return : 0
>>>> parent: MPI_Comm_spawn #0 return : 0
>>>> parent: MPI_Comm_spawn #0 return : 0
>>>> parent: MPI_Comm_spawn #0 return : 0
>>>> Child: launch
>>>> Child merged rank = 5, size = 6
>>>> parent: MPI_Comm_spawn #0 rank 4, size 6
>>>> parent: MPI_Comm_spawn #0 rank 0, size 6
>>>> parent: MPI_Comm_spawn #0 rank 2, size 6
>>>> parent: MPI_Comm_spawn #0 rank 3, size 6
>>>> parent: MPI_Comm_spawn #0 rank 1, size 6
>>>> Child 329941: exiting
>>>> parent: MPI_Comm_spawn #1 return : 0
>>>> parent: MPI_Comm_spawn #1 return : 0
>>>> parent: MPI_Comm_spawn #1 return : 0
>>>> parent: MPI_Comm_spawn #1 return : 0
>>>> parent: MPI_Comm_spawn #1 return : 0
>>>> Child: launch
>>>> parent: MPI_Comm_spawn #1 rank 0, size 6
>>>> parent: MPI_Comm_spawn #1 rank 2, size 6
>>>> parent: MPI_Comm_spawn #1 rank 1, size 6
>>>> parent: MPI_Comm_spawn #1 rank 3, size 6
>>>> Child merged rank = 5, size = 6
>>>> parent: MPI_Comm_spawn #1 rank 4, size 6
>>>> Child 329945: exiting
>>>> parent: MPI_Comm_spawn #2 return : 0
>>>> parent: MPI_Comm_spawn #2 return : 0
>>>> parent: MPI_Comm_spawn #2 return : 0
>>>> parent: MPI_Comm_spawn #2 return : 0
>>>> parent: MPI_Comm_spawn #2 return : 0
>>>> Child: launch
>>>> parent: MPI_Comm_spawn #2 rank 3, size 6
>>>> parent: MPI_Comm_spawn #2 rank 0, size 6
>>>> parent: MPI_Comm_spawn #2 rank 2, size 6
>>>> Child merged rank = 5, size = 6
>>>> parent: MPI_Comm_spawn #2 rank 1, size 6
>>>> parent: MPI_Comm_spawn #2 rank 4, size 6
>>>> Child 329949: exiting
>>>> parent: MPI_Comm_spawn #3 return : 0
>>>> parent: MPI_Comm_spawn #3 return : 0
>>>> parent: MPI_Comm_spawn #3 return : 0
>>>> parent: MPI_Comm_spawn #3 return : 0
>>>> parent: MPI_Comm_spawn #3 return : 0
>>>> Child: launch
>>>> [node:port?] too many retries sending message to <addr>, giving up
>>>> -------------------------------------------------------
>>>> Child job 5 terminated normally, but 1 process returned
>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpiexec detected that one or more processes exited with non-zero status, 
>>>> thus causing
>>>> the job to be terminated. The first process to do so was:
>>>>
>>>>   Process name: [[...],0]
>>>>   Exit code:    255
>>>> --------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/06/29435.php
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/06/29438.php
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/06/29439.php
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/06/29440.php
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29444.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29445.php
>>
>
>

Reply via email to