Jeff, I built a similar environment with master and private ip and that does not work. my understanding is each tasks has two tcp btl (one per interface), and there is currently no mechanism to tell that a node is unreachable via a given btl. (a btl picks the "best" interface for each node, but it never picks zero interface)
in order to support this, we should add extra checks to ensure the best interface is reachable (that could be achieved "statically" by parsing the routing tables, or "dynamically" by "pinging" the remote interface) On master, there is a reachable framework. Could/should the tcp btl simply use it ? Cheers, Gilles On Saturday, September 19, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Open MPI uses different heuristics depending on whether IP addresses are > public or private. > > All your IP addresses are technically "public" -- they're not in 10.x.x.x > or 192.168.x.x, for example. > > So Open MPI assumes that they are all routable to each other. > > You might want to change your 3 networks to be 10.1.x.x/16, 10.2.x.x/16, > and 10.3.x.x/16. See if that makes it work. > > > > On Sep 17, 2015, at 12:31 PM, Shang Li <shawn.li.x...@gmail.com > <javascript:;>> wrote: > > > > Hi all, > > > > I wanted to setup a 3-node ring network, each connects to the other 2 > using 2 Ethernet ports directly without a switch/router. > > > > The interface configurations could be found in the following picture. > > > > > https://www.dropbox.com/s/g75i51rrjs51b21/mpi-graph%20-%20New%20Page.png?dl=0 > > > > I've used ifconfig on each node to configure each port, and made sure I > can ssh from each node to the other 2 nodes. > > > > But a simple ring_c example doesn't work... So I turn on --mca > btl_base_verbose 30, I could see that node1 was trying to use 23.0.0.2 > (linke between node2 and 3) to get to node2 though there is a direct link > to node 2. > > > > The output log is like: > > > > [node1:01828] btl: tcp: attempting to connect() to [[19529,1],1] address > 23.0.0.2 on port 1024 > > [[19529,1],0][btl_tcp_endpoint.c:606:mca_btl_tcp_endpoint_start_connect] > from node1 to: node2 Unable to connect to the peer 23.0.0.2 on port 4: > Network is unreachable > > > > I've read the following posts and FAQs but still couldn't understand > this kind of behavior. > > > > http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3 > > http://www.open-mpi.org/faq/?category=tcp#tcp-selection > > http://www.open-mpi.org/community/lists/users/2014/11/25810.php > > > > > > Any pointers would be appreciated! Thanks in advance! > > > > My open-mpi info: > > > > Package: Open MPI gtbldadm@ubuntu-12 Distribution > > Open MPI: 1.0.0.22 > > Open MPI repo revision: git714842d > > Open MPI release date: May 27, 2015 > > Open RTE: 1.0.0.22 > > Open RTE repo revision: git714842d > > Open RTE release date: May 27, 2015 > > OPAL: 1.0.0.22 > > OPAL repo revision: git714842d > > OPAL release date: May 27, 2015 > > MPI API: 2.1 > > > > > > Best, > > Shawn > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org <javascript:;> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27612.php > > > -- > Jeff Squyres > jsquy...@cisco.com <javascript:;> > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org <javascript:;> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27626.php >