Jason, How many nodes are you running on ?
Since you have an IB network, IB is used for intra node communication between tasks that are not part of the same OpenMPI job (read spawn group) I can make a simple patch to use tcp instead of IB for these intra node communication, Let me know if you are willing to give it a try Cheers, Gilles On Wednesday, June 15, 2016, Jason Maldonis <maldo...@wisc.edu> wrote: > Thanks Ralph for all the help. I will do that until it gets fixed. > > Nathan, I am very very interested in this working because we are > developing some new cool code for research in materials science. This is > the last piece of the puzzle for us I believe. I can use TCP for now though > of course. While I doubt I can help, if you are having trouble reproducing > the problem or something else, feel free to let me know. I understand you > probably have a bunch of other things on your plate too, but if there is > something I can do to speed up the process, just let me know. > > Lastly, what are the chances there is a place/website/etc where I can > watch to see when the fix for this has been made? > > Thanks everyone! > Jason > > Jason Maldonis > Research Assistant of Professor Paul Voyles > Materials Science Grad Student > University of Wisconsin, Madison > 1509 University Ave, Rm M142 > Madison, WI 53706 > maldo...@wisc.edu <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');> > 608-295-5532 > > On Tue, Jun 14, 2016 at 4:51 PM, Ralph Castain <r...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: > >> You don’t want to always use those options as your performance will take >> a hit - TCP vs Infiniband isn’t a good option. Sadly, this is something we >> need someone like Nathan to address as it is a bug in the code base, and in >> an area I’m not familiar with >> >> For now, just use TCP so you can move forward >> >> >> On Jun 14, 2016, at 2:14 PM, Jason Maldonis <maldo...@wisc.edu >> <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');>> wrote: >> >> Ralph, The problem *does* go away if I add "-mca btl tcp,sm,self" to the >> mpiexec cmd line. (By the way, I am using mpiexec rather than mpirun; do >> you recommend one over the other?) Will you tell me what this means for me? >> For example, should I always append these arguments to mpiexec for my >> non-test jobs as well? I do not know what you mean by fabric >> unfortunately, but I can give you some system information (see end of >> email). Unfortunately I am not a system admin so I do not have sudo rights. >> Just let me know if I can tell you something more specific though and I >> will get it. >> >> Nathan, Thank you for your response. Unfortunately I have no idea what >> that means :( I can forward that to our cluster managers, but I do not >> know if that is enough information for them to understand what they might >> need to do to help me with this issue. >> >> $ lscpu >> Architecture: x86_64 >> CPU op-mode(s): 32-bit, 64-bit >> Byte Order: Little Endian >> CPU(s): 20 >> On-line CPU(s) list: 0-19 >> Thread(s) per core: 1 >> Core(s) per socket: 10 >> Socket(s): 2 >> NUMA node(s): 2 >> Vendor ID: GenuineIntel >> CPU family: 6 >> Model: 63 >> Stepping: 2 >> CPU MHz: 2594.159 >> BogoMIPS: 5187.59 >> Virtualization: VT-x >> L1d cache: 32K >> L1i cache: 32K >> L2 cache: 256K >> L3 cache: 25600K >> NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18 >> NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19 >> >> Thanks, >> Jason >> >> Jason Maldonis >> Research Assistant of Professor Paul Voyles >> Materials Science Grad Student >> University of Wisconsin, Madison >> 1509 University Ave, Rm M142 >> Madison, WI 53706 >> maldo...@wisc.edu <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');> >> 608-295-5532 >> >> On Tue, Jun 14, 2016 at 1:27 PM, Nathan Hjelm <hje...@me.com >> <javascript:_e(%7B%7D,'cvml','hje...@me.com');>> wrote: >> >>> That message is coming from udcm in the openib btl. It indicates some >>> sort of failure in the connection mechanism. It can happen if the listening >>> thread no longer exists or is taking too long to process messages. >>> >>> -Nathan >>> >>> >>> On Jun 14, 2016, at 12:20 PM, Ralph Castain <r...@open-mpi.org >>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: >>> >>> Hmm…I’m unable to replicate a problem on my machines. What fabric are >>> you using? Does the problem go away if you add “-mca btl tcp,sm,self” to >>> the mpirun cmd line? >>> >>> On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu >>> <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');>> wrote: >>> Hi Ralph, et. al, >>> >>> Great, thank you for the help. I downloaded the mpi loop spawn test >>> directly from what I think is the master repo on github: >>> https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c >>> I am still using the mpi code from 1.10.2, however. >>> >>> Is that test updated with the correct code? If so, I am still getting >>> the same "too many retries sending message to 0x0184:0x00001d27, giving up" >>> errors. I also just downloaded the June 14 nightly tarball (7.79MB) from: >>> https://www.open-mpi.org/nightly/v2.x/ and I get the same error. >>> >>> Could you please point me to the correct code? >>> >>> If you need me to provide more information please let me know. >>> >>> Thank you, >>> Jason >>> >>> Jason Maldonis >>> Research Assistant of Professor Paul Voyles >>> Materials Science Grad Student >>> University of Wisconsin, Madison >>> 1509 University Ave, Rm M142 >>> Madison, WI 53706 >>> maldo...@wisc.edu <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');> >>> 608-295-5532 >>> >>> On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org >>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: >>> >>>> I dug into this a bit (with some help from others) and found that the >>>> spawn code appears to be working correctly - it is the test in orte/test >>>> that is wrong. The test has been correctly updated in the 2.x and master >>>> repos, but we failed to backport it to the 1.10 series. I have done so this >>>> morning, and it will be in the upcoming 1.10.3 release (out very soon). >>>> >>>> >>>> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org >>>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: >>>> >>>> No, that PR has nothing to do with loop_spawn. I’ll try to take a look >>>> at the problem. >>>> >>>> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu >>>> <javascript:_e(%7B%7D,'cvml','maldo...@wisc.edu');>> wrote: >>>> >>>> Hello, >>>> >>>> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the >>>> spawn functionality to work inside a for loop, but continue to get the >>>> error "too many retries sending message to <addr>, giving up" somewhere >>>> down the line in the for loop, seemingly because the processors are not >>>> being fully freed when disconnecting/finishing. I found the >>>> orte/test/mpi/loop_spawn.c >>>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> >>>> example/test, and it has the exact same problem. I also found this >>>> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing >>>> list post from ~ a month and a half ago. >>>> >>>> Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the same >>>> issue I am having (ie the loop_spawn example not working)? If so, do >>>> you know if we can downgrade to e.g. 1.10.1 or another version? Or is there >>>> another solution to fix this bug until you get a new release out (or is one >>>> coming shortly to fix this maybe?)? >>>> >>>> Below is the output of the loop_spawn test on our university's >>>> cluster, which I know very little about in terms of architecture but can >>>> get information if it's helpful. The large group of people who manage this >>>> cluster are very good. >>>> >>>> Thanks for your time. >>>> >>>> Jason >>>> >>>> mpiexec -np 5 loop_spawn >>>> parent******************************* >>>> parent: Launching MPI* >>>> parent******************************* >>>> parent: Launching MPI* >>>> parent******************************* >>>> parent: Launching MPI* >>>> parent******************************* >>>> parent: Launching MPI* >>>> parent******************************* >>>> parent: Launching MPI* >>>> parent: MPI_Comm_spawn #0 return : 0 >>>> parent: MPI_Comm_spawn #0 return : 0 >>>> parent: MPI_Comm_spawn #0 return : 0 >>>> parent: MPI_Comm_spawn #0 return : 0 >>>> parent: MPI_Comm_spawn #0 return : 0 >>>> Child: launch >>>> Child merged rank = 5, size = 6 >>>> parent: MPI_Comm_spawn #0 rank 4, size 6 >>>> parent: MPI_Comm_spawn #0 rank 0, size 6 >>>> parent: MPI_Comm_spawn #0 rank 2, size 6 >>>> parent: MPI_Comm_spawn #0 rank 3, size 6 >>>> parent: MPI_Comm_spawn #0 rank 1, size 6 >>>> Child 329941: exiting >>>> parent: MPI_Comm_spawn #1 return : 0 >>>> parent: MPI_Comm_spawn #1 return : 0 >>>> parent: MPI_Comm_spawn #1 return : 0 >>>> parent: MPI_Comm_spawn #1 return : 0 >>>> parent: MPI_Comm_spawn #1 return : 0 >>>> Child: launch >>>> parent: MPI_Comm_spawn #1 rank 0, size 6 >>>> parent: MPI_Comm_spawn #1 rank 2, size 6 >>>> parent: MPI_Comm_spawn #1 rank 1, size 6 >>>> parent: MPI_Comm_spawn #1 rank 3, size 6 >>>> Child merged rank = 5, size = 6 >>>> parent: MPI_Comm_spawn #1 rank 4, size 6 >>>> Child 329945: exiting >>>> parent: MPI_Comm_spawn #2 return : 0 >>>> parent: MPI_Comm_spawn #2 return : 0 >>>> parent: MPI_Comm_spawn #2 return : 0 >>>> parent: MPI_Comm_spawn #2 return : 0 >>>> parent: MPI_Comm_spawn #2 return : 0 >>>> Child: launch >>>> parent: MPI_Comm_spawn #2 rank 3, size 6 >>>> parent: MPI_Comm_spawn #2 rank 0, size 6 >>>> parent: MPI_Comm_spawn #2 rank 2, size 6 >>>> Child merged rank = 5, size = 6 >>>> parent: MPI_Comm_spawn #2 rank 1, size 6 >>>> parent: MPI_Comm_spawn #2 rank 4, size 6 >>>> Child 329949: exiting >>>> parent: MPI_Comm_spawn #3 return : 0 >>>> parent: MPI_Comm_spawn #3 return : 0 >>>> parent: MPI_Comm_spawn #3 return : 0 >>>> parent: MPI_Comm_spawn #3 return : 0 >>>> parent: MPI_Comm_spawn #3 return : 0 >>>> Child: launch >>>> [node:port?] too many retries sending message to <addr>, giving up >>>> ------------------------------------------------------- >>>> Child job 5 terminated normally, but 1 process returned >>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>> ------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> mpiexec detected that one or more processes exited with non-zero status, >>>> thus causing >>>> the job to be terminated. The first process to do so was: >>>> >>>> Process name: [[...],0] >>>> Exit code: 255 >>>> -------------------------------------------------------------------------- >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2016/06/29425.php >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2016/06/29435.php >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/06/29438.php >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/06/29439.php >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/06/29440.php >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29444.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29445.php >> > >