Try adding —novm to that cmd line - if all the nodes are topologically 
identical (minus the GPU), then this will speed things up.


> On Sep 24, 2015, at 9:26 AM, Matt Thompson <fort...@gmail.com> wrote:
> 
> On Thu, Sep 24, 2015 at 12:10 PM, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> Ah, sorry - wrong param. It’s the out-of-band that is having the problem. Try 
> adding —mca oob_tcp_if_include <foo>
> 
> Ooh. Okay. Look at this:
> 
> (13) $ mpirun --mca oob_tcp_if_include ib0 -np 2 ./helloWorld.x
> Process 1 of 2 is on r509i2n17 
> Process 0 of 2 is on r509i2n17 
> 
> So that is nice. Now the spin up if I have 8 or so nodes is rather...slow. 
> But at this point I'll take working over efficient. Quick startup can come 
> later.
> 
> Matt
> 
>  
> 
> 
>> On Sep 24, 2015, at 8:56 AM, Matt Thompson <fort...@gmail.com 
>> <mailto:fort...@gmail.com>> wrote:
>> 
>> Ralph,
>> 
>> I believe these nodes might have both an Ethernet and Infiniband port where 
>> the Ethernet port is not the one to use. Is there a way to tell Open MPI to 
>> ignore any ethernet devices it sees? I've tried:
>> --mca btl sm,openib,self
>> and (based on the advice of the much more intelligent support at NAS):
>> --mca btl openib,self --mca btl_openib_if_include mlx4_0,mlx4_1
>> But neither worked.
>> 
>> Matt
>> 
>> 
>> On Thu, Sep 24, 2015 at 11:41 AM, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> Starting in the 1.7 series, OMPI by default launches daemons on all nodes in 
>> the allocation during startup. This is done so we can “probe” the topology 
>> of the nodes and use that info during the process mapping procedure - e.g., 
>> if you want to map-by NUMA regions.
>> 
>> What is happening here is that some of the nodes in your allocation aren’t 
>> allowing those daemons to callback to mpirun. Either a firewall is in the 
>> way, or something is preventing it.
>> 
>> If you don’t want to launch on those other nodes, you could just add —novm 
>> to your cmd line, or use the —host option to restrict us to your local node. 
>> However, I imagine you got the bigger allocation so you could use it :-)
>> 
>> In which case, you need to remove the obstacle. You might check for 
>> firewall, or check to see if multiple NICs are on the non-maia nodes (this 
>> can sometimes confuse things, especially if someone put the NICs on the same 
>> IP subnet)
>> 
>> HTH
>> Ralph
>> 
>> 
>> 
>>> On Sep 24, 2015, at 8:18 AM, Matt Thompson <fort...@gmail.com 
>>> <mailto:fort...@gmail.com>> wrote:
>>> 
>>> Open MPI Users,
>>> 
>>> I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7 
>>> using this configure string:
>>> 
>>>  ./configure --disable-vt --with-tm=/PBS --with-verbs 
>>> --disable-wrapper-rpath \
>>>     CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \
>>>     CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \
>>>     --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee 
>>> configure.pgi15.7.log
>>> 
>>> It seemed to pass 'make check'. 
>>> 
>>> I'm working at pleiades at NAS, and there they have both Sandy Bridge nodes 
>>> with GPUs (maia) and regular Sandy Bridge compute nodes (here after called 
>>> Sandy) without. To be extra careful (since PGI compiles to the architecture 
>>> you build on) I took a Westmere node and built Open MPI there just in case.
>>> 
>>> So, as I said, all seems to work with a test. I now grab a maia node, 
>>> maia1, of an allocation of 4 I had:
>>> 
>>> (102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
>>> (103) $ mpirun -np 2 ./helloWorld.x
>>> Process 0 of 2 is on maia1 
>>> Process 1 of 2 is on maia1 
>>> 
>>> Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an 
>>> allocation of 8 I had:
>>> 
>>> (49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
>>> (50) $ mpirun -np 2 ./helloWorld.x
>>> [r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> [r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> [r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> [r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> [r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> [r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> 
>>> Hmm. Let's try turning off tcp (often my first thought when on an 
>>> Infiniband system):
>>> 
>>> (51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x
>>> [r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> [r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> [r323i5n8:57432] [[62996,0],4] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> [r323i5n7:67290] [[62996,0],3] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> [r323i5n11:13066] [[62996,0],7] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> [r323i5n10:35332] [[62996,0],6] tcp_peer_send_blocking: send() to socket 9 
>>> failed: Broken pipe (32)
>>> 
>>> Now, the nodes reporting the issue seem to be the "other" nodes on the 
>>> allocation that are in a different rack:
>>> 
>>> (52) $ cat $PBS_NODEFILE | uniq
>>> r321i7n16
>>> r321i7n17
>>> r323i5n6
>>> r323i5n7
>>> r323i5n8
>>> r323i5n9
>>> r323i5n10
>>> r323i5n11
>>> 
>>> Maybe that's a clue? I didn't think this would matter if I only ran two 
>>> processes...and it works on the multi-node maia allocation.
>>> 
>>> I've tried searching the web, but the only place I've seen 
>>> tcp_peer_send_blocking is in a PDF where they say it's an error that can be 
>>> seen:
>>> 
>>> http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf
>>>  
>>> <http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf>
>>> 
>>> Any ideas for what this error can mean?
>>> 
>>> -- 
>>> Matt Thompson
>>> Man Among Men
>>> Fulcrum of History
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/09/27669.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/09/27669.php>
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/09/27670.php 
>> <http://www.open-mpi.org/community/lists/users/2015/09/27670.php>
>> 
>> 
>> 
>> -- 
>> Matt Thompson
>> Man Among Men
>> Fulcrum of History
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/09/27671.php 
>> <http://www.open-mpi.org/community/lists/users/2015/09/27671.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27672.php 
> <http://www.open-mpi.org/community/lists/users/2015/09/27672.php>
> 
> 
> 
> -- 
> Matt Thompson
> Man Among Men
> Fulcrum of History
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27673.php 
> <http://www.open-mpi.org/community/lists/users/2015/09/27673.php>

Reply via email to