Another thing you can do is (a) ensure you built with —enable-debug, and then (b) run it with -mca oob_base_verbose 100 (without the tcp_if_include option) so we can watch the connection handshake and see what it is doing. The —hetero-nodes will have not affect here and can be ignored.
Ralph > On Nov 10, 2014, at 5:12 PM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > Hi, > > IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really use > all the published interfaces. > > by any change, are you running a firewall on your head node ? > one possible explanation is the compute node tries to access the public > interface of the head node, and packets get dropped by the firewall. > > if you are running a firewall, can you make a test without it ? > /* if you do need NAT, then just remove the DROP and REJECT rules "/ > > an other possible explanation is the compute node is doing (reverse) dns > requests with the public name and/or ip of the head node and that takes some > time to complete (success or failure, this does not really matter here) > > /* a simple test is to make sure all the hosts/ip of the head node are in the > /etc/hosts of the compute node */ > > could you check your network config (firewall and dns) ? > > can you reproduce the delay when running mpirun on the head node and with one > mpi task on the compute node ? > > if yes, then the hard way to trace the delay issue would be to strace -ttt > both orted and mpi task that are launched on the compute node and see where > the time is lost. > /* at this stage, i would suspect orted ... */ > > Cheers, > > Gilles > > On Mon, Nov 10, 2014 at 5:56 PM, Reuti <re...@staff.uni-marburg.de > <mailto:re...@staff.uni-marburg.de>> wrote: > Hi, > > Am 10.11.2014 um 16:39 schrieb Ralph Castain: > > > That is indeed bizarre - we haven’t heard of anything similar from other > > users. What is your network configuration? If you use oob_tcp_if_include or > > exclude, can you resolve the problem? > > Thx - this option helped to get it working. > > These tests were made for sake of simplicity between the headnode of the > cluster and one (idle) compute node. I tried then between the (identical) > compute nodes and this worked fine. The headnode of the cluster and the > compute node are slightly different though (i.e. number of cores), and using > eth1 resp. eth0 for the internal network of the cluster. > > I tried --hetero-nodes with no change. > > Then I turned to: > > reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca oob_tcp_if_include > 192.168.154.0/26 <http://192.168.154.0/26> -n 4 --hetero-nodes --hostfile > machines ./mpihello; date > > and the application started instantly. On another cluster, where the headnode > is identical to the compute nodes but with the same network setup as above, I > observed a delay of "only" 30 seconds. Nevertheless, also on this cluster the > working addition was the correct "oob_tcp_if_include" to solve the issue. > > The questions which remain: a) is this a targeted behavior, b) what changed > in this scope between 1.8.1 and 1.8.2? > > -- Reuti > > > > > >> On Nov 10, 2014, at 4:50 AM, Reuti <re...@staff.uni-marburg.de > >> <mailto:re...@staff.uni-marburg.de>> wrote: > >> > >> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres): > >> > >>> Wow, that's pretty terrible! :( > >>> > >>> Is the behavior BTL-specific, perchance? E.G., if you only use certain > >>> BTLs, does the delay disappear? > >> > >> You mean something like: > >> > >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile > >> machines ./mpihello; date > >> Mon Nov 10 13:44:34 CET 2014 > >> Hello World from Node 1. > >> Total: 4 > >> Universe: 4 > >> Hello World from Node 0. > >> Hello World from Node 3. > >> Hello World from Node 2. > >> Mon Nov 10 13:46:42 CET 2014 > >> > >> (the above was even the latest v1.8.3-186-g978f61d) > >> > >> Falling back to 1.8.1 gives (as expected): > >> > >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile > >> machines ./mpihello; date > >> Mon Nov 10 13:49:51 CET 2014 > >> Hello World from Node 1. > >> Total: 4 > >> Universe: 4 > >> Hello World from Node 0. > >> Hello World from Node 2. > >> Hello World from Node 3. > >> Mon Nov 10 13:49:53 CET 2014 > >> > >> > >> -- Reuti > >> > >>> FWIW: the use-all-IP interfaces approach has been in OMPI forever. > >>> > >>> Sent from my phone. No type good. > >>> > >>>> On Nov 10, 2014, at 6:42 AM, Reuti <re...@staff.uni-marburg.de > >>>> <mailto:re...@staff.uni-marburg.de>> wrote: > >>>> > >>>>> Am 10.11.2014 um 12:24 schrieb Reuti: > >>>>> > >>>>> Hi, > >>>>> > >>>>>> Am 09.11.2014 um 05:38 schrieb Ralph Castain: > >>>>>> > >>>>>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. > >>>>>> Each process receives a complete map of that info for every process in > >>>>>> the job. So when the TCP btl sets itself up, it attempts to connect > >>>>>> across -all- the interfaces published by the other end. > >>>>>> > >>>>>> So it doesn’t matter what hostname is provided by the RM. We discover > >>>>>> and “share” all of the interface info for every node, and then use > >>>>>> them for loadbalancing. > >>>>> > >>>>> does this lead to any time delay when starting up? I stayed with Open > >>>>> MPI 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there > >>>>> is a delay when the applications starts in my first compilation of > >>>>> 1.8.3 I disregarded even all my extra options and run it outside of any > >>>>> queuingsystem - the delay remains - on two different clusters. > >>>> > >>>> I forgot to mention: the delay is more or less exactly 2 minutes from > >>>> the time I issued `mpiexec` until the `mpihello` starts up (there is no > >>>> delay for the initial `ssh` to reach the other node though). > >>>> > >>>> -- Reuti > >>>> > >>>> > >>>>> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 > >>>>> already creates this delay when starting up a simple mpihello. I assume > >>>>> it may lay in the way how to reach other machines, as with one single > >>>>> machine there is no delay. But using one (and only one - no tree spawn > >>>>> involved) additional machine already triggers this delay. > >>>>> > >>>>> Did anyone else notice it? > >>>>> > >>>>> -- Reuti > >>>>> > >>>>> > >>>>>> HTH > >>>>>> Ralph > >>>>>> > >>>>>> > >>>>>>> On Nov 8, 2014, at 8:13 PM, Brock Palen <bro...@umich.edu > >>>>>>> <mailto:bro...@umich.edu>> wrote: > >>>>>>> > >>>>>>> Ok I figured, i'm going to have to read some more for my own > >>>>>>> curiosity. The reason I mention the Resource Manager we use, and that > >>>>>>> the hostnames given but PBS/Torque match the 1gig-e interfaces, i'm > >>>>>>> curious what path it would take to get to a peer node when the node > >>>>>>> list given all match the 1gig interfaces but yet data is being sent > >>>>>>> out the 10gig eoib0/ib0 interfaces. > >>>>>>> > >>>>>>> I'll go do some measurements and see. > >>>>>>> > >>>>>>> Brock Palen > >>>>>>> www.umich.edu/~brockp <http://www.umich.edu/~brockp> > >>>>>>> CAEN Advanced Computing > >>>>>>> XSEDE Campus Champion > >>>>>>> bro...@umich.edu <mailto:bro...@umich.edu> > >>>>>>> (734)936-1985 <tel:%28734%29936-1985> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres) > >>>>>>>> <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote: > >>>>>>>> > >>>>>>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by > >>>>>>>> default. > >>>>>>>> > >>>>>>>> This short FAQ has links to 2 other FAQs that provide detailed > >>>>>>>> information about reachability: > >>>>>>>> > >>>>>>>> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network > >>>>>>>> <http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network> > >>>>>>>> > >>>>>>>> The usNIC BTL uses UDP for its wire transport and actually does a > >>>>>>>> much more standards-conformant peer reachability determination > >>>>>>>> (i.e., it actually checks routing tables to see if it can reach a > >>>>>>>> given peer which has all kinds of caching benefits, kernel controls > >>>>>>>> if you want them, etc.). We haven't back-ported this to the TCP BTL > >>>>>>>> because a) most people who use TCP for MPI still use a single L2 > >>>>>>>> address space, and b) no one has asked for it. :-) > >>>>>>>> > >>>>>>>> As for the round robin scheduling, there's no indication from the > >>>>>>>> Linux TCP stack what the bandwidth is on a given IP interface. So > >>>>>>>> unless you use the btl_tcp_bandwidth_<IP_INTERFACE_NAME> (e.g., > >>>>>>>> btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across > >>>>>>>> them equally. > >>>>>>>> > >>>>>>>> If you have multiple IP interfaces sharing a single physical link, > >>>>>>>> there will likely be no benefit from having Open MPI use more than > >>>>>>>> one of them. You should probably use btl_tcp_if_include / > >>>>>>>> btl_tcp_if_exclude to select just one. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> On Nov 7, 2014, at 2:53 PM, Brock Palen <bro...@umich.edu > >>>>>>>>> <mailto:bro...@umich.edu>> wrote: > >>>>>>>>> > >>>>>>>>> I was doing a test on our IB based cluster, where I was diabling IB > >>>>>>>>> > >>>>>>>>> --mca btl ^openib --mca mtl ^mxm > >>>>>>>>> > >>>>>>>>> I was sending very large messages >1GB and I was surppised by the > >>>>>>>>> speed. > >>>>>>>>> > >>>>>>>>> I noticed then that of all our ethernet interfaces > >>>>>>>>> > >>>>>>>>> eth0 (1gig-e) > >>>>>>>>> ib0 (ip over ib, for lustre configuration at vendor request) > >>>>>>>>> eoib0 (ethernet over IB interface for IB -> Ethernet gateway for > >>>>>>>>> some extrnal storage support at >1Gig speed > >>>>>>>>> > >>>>>>>>> I saw all three were getting traffic. > >>>>>>>>> > >>>>>>>>> We use torque for our Resource Manager and use TM support, the > >>>>>>>>> hostnames given by torque match the eth0 interfaces. > >>>>>>>>> > >>>>>>>>> How does OMPI figure out that it can also talk over the others? > >>>>>>>>> How does it chose to load balance? > >>>>>>>>> > >>>>>>>>> BTW that is fine, but we will use if_exclude on one of the IB ones > >>>>>>>>> as ib0 and eoib0 are the same physical device and may screw with > >>>>>>>>> load balancing if anyone ver falls back to TCP. > >>>>>>>>> > >>>>>>>>> Brock Palen > >>>>>>>>> www.umich.edu/~brockp <http://www.umich.edu/~brockp> > >>>>>>>>> CAEN Advanced Computing > >>>>>>>>> XSEDE Campus Champion > >>>>>>>>> bro...@umich.edu <mailto:bro...@umich.edu> > >>>>>>>>> (734)936-1985 <tel:%28734%29936-1985> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> > >>>>>>>>> Link to this post: > >>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25709.php > >>>>>>>>> <http://www.open-mpi.org/community/lists/users/2014/11/25709.php> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Jeff Squyres > >>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> > >>>>>>>> For corporate legal information go to: > >>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> > >>>>>>>> Link to this post: > >>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25713.php > >>>>>>>> <http://www.open-mpi.org/community/lists/users/2014/11/25713.php> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> > >>>>>>> Link to this post: > >>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25715.php > >>>>>>> <http://www.open-mpi.org/community/lists/users/2014/11/25715.php> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> > >>>>>> Link to this post: > >>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25716.php > >>>>>> <http://www.open-mpi.org/community/lists/users/2014/11/25716.php> > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> > >>>>> Link to this post: > >>>>> http://www.open-mpi.org/community/lists/users/2014/11/25721.php > >>>>> <http://www.open-mpi.org/community/lists/users/2014/11/25721.php> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/users/2014/11/25722.php > >>>> <http://www.open-mpi.org/community/lists/users/2014/11/25722.php> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/users/2014/11/25724.php > >>> <http://www.open-mpi.org/community/lists/users/2014/11/25724.php> > >>> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org <mailto:us...@open-mpi.org> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2014/11/25725.php > >> <http://www.open-mpi.org/community/lists/users/2014/11/25725.php> > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org <mailto:us...@open-mpi.org> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2014/11/25733.php > > <http://www.open-mpi.org/community/lists/users/2014/11/25733.php> > > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25736.php > <http://www.open-mpi.org/community/lists/users/2014/11/25736.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25737.php