Try adding —novm to that cmd line - if all the nodes are topologically identical (minus the GPU), then this will speed things up.
> On Sep 24, 2015, at 9:26 AM, Matt Thompson <fort...@gmail.com> wrote: > > On Thu, Sep 24, 2015 at 12:10 PM, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > Ah, sorry - wrong param. It’s the out-of-band that is having the problem. Try > adding —mca oob_tcp_if_include <foo> > > Ooh. Okay. Look at this: > > (13) $ mpirun --mca oob_tcp_if_include ib0 -np 2 ./helloWorld.x > Process 1 of 2 is on r509i2n17 > Process 0 of 2 is on r509i2n17 > > So that is nice. Now the spin up if I have 8 or so nodes is rather...slow. > But at this point I'll take working over efficient. Quick startup can come > later. > > Matt > > > > >> On Sep 24, 2015, at 8:56 AM, Matt Thompson <fort...@gmail.com >> <mailto:fort...@gmail.com>> wrote: >> >> Ralph, >> >> I believe these nodes might have both an Ethernet and Infiniband port where >> the Ethernet port is not the one to use. Is there a way to tell Open MPI to >> ignore any ethernet devices it sees? I've tried: >> --mca btl sm,openib,self >> and (based on the advice of the much more intelligent support at NAS): >> --mca btl openib,self --mca btl_openib_if_include mlx4_0,mlx4_1 >> But neither worked. >> >> Matt >> >> >> On Thu, Sep 24, 2015 at 11:41 AM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> Starting in the 1.7 series, OMPI by default launches daemons on all nodes in >> the allocation during startup. This is done so we can “probe” the topology >> of the nodes and use that info during the process mapping procedure - e.g., >> if you want to map-by NUMA regions. >> >> What is happening here is that some of the nodes in your allocation aren’t >> allowing those daemons to callback to mpirun. Either a firewall is in the >> way, or something is preventing it. >> >> If you don’t want to launch on those other nodes, you could just add —novm >> to your cmd line, or use the —host option to restrict us to your local node. >> However, I imagine you got the bigger allocation so you could use it :-) >> >> In which case, you need to remove the obstacle. You might check for >> firewall, or check to see if multiple NICs are on the non-maia nodes (this >> can sometimes confuse things, especially if someone put the NICs on the same >> IP subnet) >> >> HTH >> Ralph >> >> >> >>> On Sep 24, 2015, at 8:18 AM, Matt Thompson <fort...@gmail.com >>> <mailto:fort...@gmail.com>> wrote: >>> >>> Open MPI Users, >>> >>> I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7 >>> using this configure string: >>> >>> ./configure --disable-vt --with-tm=/PBS --with-verbs >>> --disable-wrapper-rpath \ >>> CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \ >>> CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \ >>> --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee >>> configure.pgi15.7.log >>> >>> It seemed to pass 'make check'. >>> >>> I'm working at pleiades at NAS, and there they have both Sandy Bridge nodes >>> with GPUs (maia) and regular Sandy Bridge compute nodes (here after called >>> Sandy) without. To be extra careful (since PGI compiles to the architecture >>> you build on) I took a Westmere node and built Open MPI there just in case. >>> >>> So, as I said, all seems to work with a test. I now grab a maia node, >>> maia1, of an allocation of 4 I had: >>> >>> (102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c >>> (103) $ mpirun -np 2 ./helloWorld.x >>> Process 0 of 2 is on maia1 >>> Process 1 of 2 is on maia1 >>> >>> Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an >>> allocation of 8 I had: >>> >>> (49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c >>> (50) $ mpirun -np 2 ./helloWorld.x >>> [r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> [r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> [r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> [r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> [r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> [r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> >>> Hmm. Let's try turning off tcp (often my first thought when on an >>> Infiniband system): >>> >>> (51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x >>> [r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> [r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> [r323i5n8:57432] [[62996,0],4] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> [r323i5n7:67290] [[62996,0],3] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> [r323i5n11:13066] [[62996,0],7] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> [r323i5n10:35332] [[62996,0],6] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> >>> Now, the nodes reporting the issue seem to be the "other" nodes on the >>> allocation that are in a different rack: >>> >>> (52) $ cat $PBS_NODEFILE | uniq >>> r321i7n16 >>> r321i7n17 >>> r323i5n6 >>> r323i5n7 >>> r323i5n8 >>> r323i5n9 >>> r323i5n10 >>> r323i5n11 >>> >>> Maybe that's a clue? I didn't think this would matter if I only ran two >>> processes...and it works on the multi-node maia allocation. >>> >>> I've tried searching the web, but the only place I've seen >>> tcp_peer_send_blocking is in a PDF where they say it's an error that can be >>> seen: >>> >>> http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf >>> >>> <http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf> >>> >>> Any ideas for what this error can mean? >>> >>> -- >>> Matt Thompson >>> Man Among Men >>> Fulcrum of History >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/09/27669.php >>> <http://www.open-mpi.org/community/lists/users/2015/09/27669.php> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/09/27670.php >> <http://www.open-mpi.org/community/lists/users/2015/09/27670.php> >> >> >> >> -- >> Matt Thompson >> Man Among Men >> Fulcrum of History >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/09/27671.php >> <http://www.open-mpi.org/community/lists/users/2015/09/27671.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27672.php > <http://www.open-mpi.org/community/lists/users/2015/09/27672.php> > > > > -- > Matt Thompson > Man Among Men > Fulcrum of History > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27673.php > <http://www.open-mpi.org/community/lists/users/2015/09/27673.php>