Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

Jason Maldonis Wed, 15 Jun 2016 17:12:12 -0400 (EDT)

Hi Gilles,

I would like to be able to run on anywhere from 1-16 nodes.


Let me explain our (mpi/parallelism) situation briefly for more context:
We have a "master" job that needs MPI functionality.  This master job is
written in python (we use mpi4py).  The master job then makes spawn calls
out to other MPI programs written in either C or Fortran. The C/Fortran
programs do not need to communicate with the master python MPI job
(although it would be nice to have the option at some point).  The master
job runs the child processes, collects their outputs, then reruns them with
new input parameters. This continues for a long time in a big for-loop.

So that's our setup, and that is the context in which I found this issue.

Unfortunately I am not the cluster admin for our university's cluster (at
UW Madison).  There are probably close to 100 people who use the cluster,
so I am guessing that the admins might be reluctant to install an mpi
library that might not be stable.  If you think I am misunderstanding what
you are asking, let me know please.  If the ETA on fixing this bug is going
to be closer to 6 months rather than 1 month, it might be useful to do what
you suggest if it will provide a noticeable speed up.

I'll admit that I have run into quite a few issues already getting the MPI
within this code to work, so I am slightly hesitant to add more complexity
to it and other potential avenues for hangs/crashes :(  I don't understand
the intricacies of MPI very well though, so if you don't think this is an
issue then that is a big bonus!

Thanks!
Jason



Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
maldo...@wisc.edu
608-295-5532

On Wed, Jun 15, 2016 at 5:34 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Jason,
>
> How many nodes are you running on ?
>
> Since you have an IB network, IB is used for intra node communication
> between tasks that are not part of the same OpenMPI job (read spawn group)
> I can make a simple patch to use tcp instead of IB for these intra node
> communication,
> Let me know if you are willing to give it a try
>
> Cheers,
>
> Gilles
>
>
> On Wednesday, June 15, 2016, Jason Maldonis <maldo...@wisc.edu> wrote:
>
>> Thanks Ralph for all the help. I will do that until it gets fixed.
>>
>> Nathan, I am very very interested in this working because we are
>> developing some new cool code for research in materials science. This is
>> the last piece of the puzzle for us I believe. I can use TCP for now though
>> of course. While I doubt I can help, if you are having trouble reproducing
>> the problem or something else, feel free to let me know. I understand you
>> probably have a bunch of other things on your plate too, but if there is
>> something I can do to speed up the process, just let me know.
>>
>> Lastly, what are the chances there is a place/website/etc where I can
>> watch to see when the fix for this has been made?
>>
>> Thanks everyone!
>> Jason
>>
>> Jason Maldonis
>> Research Assistant of Professor Paul Voyles
>> Materials Science Grad Student
>> University of Wisconsin, Madison
>> 1509 University Ave, Rm M142
>> Madison, WI 53706
>> maldo...@wisc.edu
>> 608-295-5532
>>
>> On Tue, Jun 14, 2016 at 4:51 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> You don’t want to always use those options as your performance will take
>>> a hit - TCP vs Infiniband isn’t a good option. Sadly, this is something we
>>> need someone like Nathan to address as it is a bug in the code base, and in
>>> an area I’m not familiar with
>>>
>>> For now, just use TCP so you can move forward
>>>
>>>
>>> On Jun 14, 2016, at 2:14 PM, Jason Maldonis <maldo...@wisc.edu> wrote:
>>>
>>> Ralph, The problem *does* go away if I add "-mca btl tcp,sm,self" to
>>> the mpiexec cmd line. (By the way, I am using mpiexec rather than mpirun;
>>> do you recommend one over the other?) Will you tell me what this means for
>>> me? For example, should I always append these arguments to mpiexec for my
>>> non-test jobs as well?   I do not know what you mean by fabric
>>> unfortunately, but I can give you some system information (see end of
>>> email). Unfortunately I am not a system admin so I do not have sudo rights.
>>> Just let me know if I can tell you something more specific though and I
>>> will get it.
>>>
>>> Nathan,  Thank you for your response. Unfortunately I have no idea what
>>> that means :(  I can forward that to our cluster managers, but I do not
>>> know if that is enough information for them to understand what they might
>>> need to do to help me with this issue.
>>>
>>> $ lscpu
>>> Architecture:          x86_64
>>> CPU op-mode(s):        32-bit, 64-bit
>>> Byte Order:            Little Endian
>>> CPU(s):                20
>>> On-line CPU(s) list:   0-19
>>> Thread(s) per core:    1
>>> Core(s) per socket:    10
>>> Socket(s):             2
>>> NUMA node(s):          2
>>> Vendor ID:             GenuineIntel
>>> CPU family:            6
>>> Model:                 63
>>> Stepping:              2
>>> CPU MHz:               2594.159
>>> BogoMIPS:              5187.59
>>> Virtualization:        VT-x
>>> L1d cache:             32K
>>> L1i cache:             32K
>>> L2 cache:              256K
>>> L3 cache:              25600K
>>> NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
>>> NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
>>>
>>> Thanks,
>>> Jason
>>>
>>> Jason Maldonis
>>> Research Assistant of Professor Paul Voyles
>>> Materials Science Grad Student
>>> University of Wisconsin, Madison
>>> 1509 University Ave, Rm M142
>>> Madison, WI 53706
>>> maldo...@wisc.edu
>>> 608-295-5532
>>>
>>> On Tue, Jun 14, 2016 at 1:27 PM, Nathan Hjelm <hje...@me.com> wrote:
>>>
>>>> That message is coming from udcm in the openib btl. It indicates some
>>>> sort of failure in the connection mechanism. It can happen if the listening
>>>> thread no longer exists or is taking too long to process messages.
>>>>
>>>> -Nathan
>>>>
>>>>
>>>> On Jun 14, 2016, at 12:20 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>
>>>> Hmm…I’m unable to replicate a problem on my machines. What fabric are
>>>> you using? Does the problem go away if you add “-mca btl tcp,sm,self” to
>>>> the mpirun cmd line?
>>>>
>>>> On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu> wrote:
>>>> Hi Ralph, et. al,
>>>>
>>>> Great, thank you for the help. I downloaded the mpi loop spawn test
>>>> directly from what I think is the master repo on github:
>>>> https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c
>>>> I am still using the mpi code from 1.10.2, however.
>>>>
>>>> Is that test updated with the correct code? If so, I am still getting
>>>> the same "too many retries sending message to 0x0184:0x00001d27, giving up"
>>>> errors. I also just downloaded the June 14 nightly tarball (7.79MB) from:
>>>> https://www.open-mpi.org/nightly/v2.x/ and I get the same error.
>>>>
>>>> Could you please point me to the correct code?
>>>>
>>>> If you need me to provide more information please let me know.
>>>>
>>>> Thank you,
>>>> Jason
>>>>
>>>> Jason Maldonis
>>>> Research Assistant of Professor Paul Voyles
>>>> Materials Science Grad Student
>>>> University of Wisconsin, Madison
>>>> 1509 University Ave, Rm M142
>>>> Madison, WI 53706
>>>> maldo...@wisc.edu
>>>> 608-295-5532
>>>>
>>>> On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org>
>>>> wrote:
>>>>
>>>>> I dug into this a bit (with some help from others) and found that the
>>>>> spawn code appears to be working correctly - it is the test in orte/test
>>>>> that is wrong. The test has been correctly updated in the 2.x and master
>>>>> repos, but we failed to backport it to the 1.10 series. I have done so 
>>>>> this
>>>>> morning, and it will be in the upcoming 1.10.3 release (out very soon).
>>>>>
>>>>>
>>>>> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>
>>>>> No, that PR has nothing to do with loop_spawn. I’ll try to take a look
>>>>> at the problem.
>>>>>
>>>>> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the
>>>>> spawn functionality to work inside a for loop, but continue to get
>>>>> the error "too many retries sending message to <addr>, giving up" 
>>>>> somewhere
>>>>> down the line in the for loop, seemingly because the processors are not
>>>>> being fully freed when disconnecting/finishing. I found the
>>>>> orte/test/mpi/loop_spawn.c
>>>>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c>
>>>>>  example/test, and it has the exact same problem. I also found this
>>>>> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing
>>>>> list post from ~ a month and a half ago.
>>>>>
>>>>> Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the
>>>>> same issue I am having (ie the loop_spawn example not working)? If
>>>>> so, do you know if we can downgrade to e.g. 1.10.1 or another version? Or
>>>>> is there another solution to fix this bug until you get a new release out
>>>>> (or is one coming shortly to fix this maybe?)?
>>>>>
>>>>> Below is the output of the loop_spawn test on our university's
>>>>> cluster, which I know very little about in terms of architecture but can
>>>>> get information if it's helpful. The large group of people who manage this
>>>>> cluster are very good.
>>>>>
>>>>> Thanks for your time.
>>>>>
>>>>> Jason
>>>>>
>>>>> mpiexec -np 5 loop_spawn
>>>>> parent*******************************
>>>>> parent: Launching MPI*
>>>>> parent*******************************
>>>>> parent: Launching MPI*
>>>>> parent*******************************
>>>>> parent: Launching MPI*
>>>>> parent*******************************
>>>>> parent: Launching MPI*
>>>>> parent*******************************
>>>>> parent: Launching MPI*
>>>>> parent: MPI_Comm_spawn #0 return : 0
>>>>> parent: MPI_Comm_spawn #0 return : 0
>>>>> parent: MPI_Comm_spawn #0 return : 0
>>>>> parent: MPI_Comm_spawn #0 return : 0
>>>>> parent: MPI_Comm_spawn #0 return : 0
>>>>> Child: launch
>>>>> Child merged rank = 5, size = 6
>>>>> parent: MPI_Comm_spawn #0 rank 4, size 6
>>>>> parent: MPI_Comm_spawn #0 rank 0, size 6
>>>>> parent: MPI_Comm_spawn #0 rank 2, size 6
>>>>> parent: MPI_Comm_spawn #0 rank 3, size 6
>>>>> parent: MPI_Comm_spawn #0 rank 1, size 6
>>>>> Child 329941: exiting
>>>>> parent: MPI_Comm_spawn #1 return : 0
>>>>> parent: MPI_Comm_spawn #1 return : 0
>>>>> parent: MPI_Comm_spawn #1 return : 0
>>>>> parent: MPI_Comm_spawn #1 return : 0
>>>>> parent: MPI_Comm_spawn #1 return : 0
>>>>> Child: launch
>>>>> parent: MPI_Comm_spawn #1 rank 0, size 6
>>>>> parent: MPI_Comm_spawn #1 rank 2, size 6
>>>>> parent: MPI_Comm_spawn #1 rank 1, size 6
>>>>> parent: MPI_Comm_spawn #1 rank 3, size 6
>>>>> Child merged rank = 5, size = 6
>>>>> parent: MPI_Comm_spawn #1 rank 4, size 6
>>>>> Child 329945: exiting
>>>>> parent: MPI_Comm_spawn #2 return : 0
>>>>> parent: MPI_Comm_spawn #2 return : 0
>>>>> parent: MPI_Comm_spawn #2 return : 0
>>>>> parent: MPI_Comm_spawn #2 return : 0
>>>>> parent: MPI_Comm_spawn #2 return : 0
>>>>> Child: launch
>>>>> parent: MPI_Comm_spawn #2 rank 3, size 6
>>>>> parent: MPI_Comm_spawn #2 rank 0, size 6
>>>>> parent: MPI_Comm_spawn #2 rank 2, size 6
>>>>> Child merged rank = 5, size = 6
>>>>> parent: MPI_Comm_spawn #2 rank 1, size 6
>>>>> parent: MPI_Comm_spawn #2 rank 4, size 6
>>>>> Child 329949: exiting
>>>>> parent: MPI_Comm_spawn #3 return : 0
>>>>> parent: MPI_Comm_spawn #3 return : 0
>>>>> parent: MPI_Comm_spawn #3 return : 0
>>>>> parent: MPI_Comm_spawn #3 return : 0
>>>>> parent: MPI_Comm_spawn #3 return : 0
>>>>> Child: launch
>>>>> [node:port?] too many retries sending message to <addr>, giving up
>>>>> -------------------------------------------------------
>>>>> Child job 5 terminated normally, but 1 process returned
>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>> -------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec detected that one or more processes exited with non-zero status, 
>>>>> thus causing
>>>>> the job to be terminated. The first process to do so was:
>>>>>
>>>>>   Process name: [[...],0]
>>>>>   Exit code:    255
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/users/2016/06/29435.php
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/06/29438.php
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/06/29439.php
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/06/29440.php
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/06/29444.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/06/29445.php
>>>
>>
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29451.php
>

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

Reply via email to