Am 13.11.2014 um 00:55 schrieb Gilles Gouaillardet: > Could you please send the output of netstat -nr on both head and compute node > ?
Head node: annemarie:~ # netstat -nr Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 0.0.0.0 137.248.x.y 0.0.0.0 UG 0 0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 137.248.x.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 192.168.151.80 0.0.0.0 255.255.255.255 UH 0 0 0 eth1 192.168.154.0 0.0.0.0 255.255.255.192 U 0 0 0 eth1 192.168.154.128 0.0.0.0 255.255.255.192 U 0 0 0 eth3 Compute node with (wrong) entry for the non-existing GW: node28:~ # netstat -nr Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 0.0.0.0 192.168.154.60 0.0.0.0 UG 0 0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 192.168.154.0 0.0.0.0 255.255.255.192 U 0 0 0 eth0 192.168.154.64 0.0.0.0 255.255.255.192 U 0 0 0 eth1 As said: when I remove the "default" entry for the GW it starts up instantly. -- Reti > no problem obfuscating the ip of the head node, i am only interested in > netmasks and routes. > > Ralph Castain <r...@open-mpi.org> wrote: >> >>> On Nov 12, 2014, at 2:45 PM, Reuti <re...@staff.uni-marburg.de> wrote: >>> >>> Am 12.11.2014 um 17:27 schrieb Reuti: >>> >>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain: >>>> >>>>> Another thing you can do is (a) ensure you built with —enable-debug, and >>>>> then (b) run it with -mca oob_base_verbose 100 (without the >>>>> tcp_if_include option) so we can watch the connection handshake and see >>>>> what it is doing. The —hetero-nodes will have not affect here and can be >>>>> ignored. >>>> >>>> Done. It really tries to connect to the outside interface of the headnode. >>>> But being there a firewall or not: the nodes have no clue how to reach >>>> 137.248.0.0 - they have no gateway to this network at all. >>> >>> I have to revert this. They think that there is a gateway although it >>> isn't. When I remove the entry by hand for the gateway in the routing table >>> it starts up instantly too. >>> >>> While I can do this on my own cluster I still have the 30 seconds delay on >>> a cluster where I'm not root, while this can be because of the firewall >>> there. The gateway on this cluster is indeed going to the outside world. >>> >>> Personally I find this behavior a little bit too aggressive to use all >>> interfaces. If you don't check this carefully beforehand and start a long >>> running application one might even not notice the delay during the startup. >> >> Agreed - do you have any suggestions on how we should choose the order in >> which to try them? I haven’t been able to come up with anything yet. Jeff >> has some fancy algo in his usnic BTL that we are going to discuss after SC >> that I’m hoping will help, but I’d be open to doing something better in the >> interim for 1.8.4 >> >>> >>> -- Reuti >>> >>> >>>> It tries so independent from the internal or external name of the headnode >>>> given in the machinefile - I hit ^C then. I attached the output of Open >>>> MPI 1.8.1 for this setup too. >>>> >>>> -- Reuti >>>> >>>> <openmpi1.8.3.txt><openmpi1.8.1.txt>_______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/11/25777.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/11/25781.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25782.php > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25783.php >