Whatever the original choice(s) of the BTL are, an interface should disqualify 
itself after few missed connections (based on the retry MCA parameter). 
However, in order to get anything sensible in this configuration you should 
change the default timeout to a reasonable value (30 seconds?).

While this approach has an overhead for short running applications, for larger 
runs it should provide a decent solution.

  George.

> On Sep 18, 2015, at 19:26 , Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> Jeff,
> 
> I built a similar environment with master and private ip and that does not 
> work.
> my understanding is each tasks has two tcp btl (one per interface),
> and there is currently no mechanism to tell that a node is unreachable
> via a given btl.
> (a btl picks the "best" interface for each node, but it never picks zero 
> interface)
> 
> in order to support this, we should add extra checks to ensure the best 
> interface is reachable
> (that could be achieved "statically" by parsing the routing tables, or 
> "dynamically" by "pinging" the remote interface)
> 
> On master, there is a reachable framework. Could/should the tcp btl simply 
> use it ?
> 
> Cheers,
> 
> Gilles
> 
> On Saturday, September 19, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com 
> <mailto:jsquy...@cisco.com>> wrote:
> Open MPI uses different heuristics depending on whether IP addresses are 
> public or private.
> 
> All your IP addresses are technically "public" -- they're not in 10.x.x.x or 
> 192.168.x.x, for example.
> 
> So Open MPI assumes that they are all routable to each other.
> 
> You might want to change your 3 networks to be 10.1.x.x/16, 10.2.x.x/16, and 
> 10.3.x.x/16.  See if that makes it work.
> 
> 
> > On Sep 17, 2015, at 12:31 PM, Shang Li <shawn.li.x...@gmail.com 
> > <javascript:;>> wrote:
> >
> > Hi all,
> >
> > I wanted to setup a 3-node ring network, each connects to the other 2 using 
> > 2 Ethernet ports directly without a switch/router.
> >
> > The interface configurations could be found in the following picture.
> >
> > https://www.dropbox.com/s/g75i51rrjs51b21/mpi-graph%20-%20New%20Page.png?dl=0
> >  
> > <https://www.dropbox.com/s/g75i51rrjs51b21/mpi-graph%20-%20New%20Page.png?dl=0>
> >
> > I've used ifconfig on each node to configure each port, and made sure I can 
> > ssh from each node to the other 2 nodes.
> >
> > But a simple ring_c example doesn't work... So I turn on  --mca 
> > btl_base_verbose 30, I could see that node1 was trying to use 23.0.0.2  
> > (linke between node2 and 3) to get to node2 though there is a direct link 
> > to node 2.
> >
> > The output log is like:
> >
> > [node1:01828] btl: tcp: attempting to connect() to [[19529,1],1] address 
> > 23.0.0.2 on port 1024
> > [[19529,1],0][btl_tcp_endpoint.c:606:mca_btl_tcp_endpoint_start_connect] 
> > from node1 to: node2 Unable to connect to the peer 23.0.0.2  on port 4: 
> > Network is unreachable
> >
> > I've read the following posts and FAQs but still couldn't understand this 
> > kind of behavior.
> >
> > http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3 
> > <http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3>
> > http://www.open-mpi.org/faq/?category=tcp#tcp-selection 
> > <http://www.open-mpi.org/faq/?category=tcp#tcp-selection>
> > http://www.open-mpi.org/community/lists/users/2014/11/25810.php 
> > <http://www.open-mpi.org/community/lists/users/2014/11/25810.php>
> >
> >
> > Any pointers would be appreciated! Thanks in advance!
> >
> > My open-mpi info:
> >
> >  Package: Open MPI gtbldadm@ubuntu-12 Distribution
> >                 Open MPI: 1.0.0.22
> >   Open MPI repo revision: git714842d
> >    Open MPI release date: May 27, 2015
> >                 Open RTE: 1.0.0.22
> >   Open RTE repo revision: git714842d
> >    Open RTE release date: May 27, 2015
> >                     OPAL: 1.0.0.22
> >       OPAL repo revision: git714842d
> >        OPAL release date: May 27, 2015
> >                  MPI API: 2.1
> >
> >
> > Best,
> > Shawn
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org <javascript:;>
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> > <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2015/09/27612.php 
> > <http://www.open-mpi.org/community/lists/users/2015/09/27612.php>
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com <javascript:;>
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/ 
> <http://www.cisco.com/web/about/doing_business/legal/cri/>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org <javascript:;>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27626.php 
> <http://www.open-mpi.org/community/lists/users/2015/09/27626.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27629.php

Reply via email to