netstat don't show loopback interface even on head node while ifconfig shows Loopback up and running on compute nodes as well as master node.
[root@pmd ~]# netstat -nr Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 192.168.3.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1 192.168.108.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ib0 239.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 10.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 0.0.0.0 192.168.3.1 0.0.0.0 UG 0 0 0 eth1 [root@compute-01-01 ~]# ifconfig lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:880 errors:0 dropped:0 overruns:0 frame:0 TX packets:880 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:150329 (146.8 KiB) TX bytes:150329 (146.8 KiB) On Thu, Nov 13, 2014 at 1:02 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > but it is running on your head node isnt't it ? > > you might want to double check why there is no loopback interface on > your compute nodes. > in the mean time, you can disable lo and ib0 interfaces > > Cheers, > > Gilles > > On 2014/11/13 16:59, Syed Ahsan Ali wrote: >> I don't see it running >> >> [pmdtest@compute-01-01 ~]$ netstat -nr >> Kernel IP routing table >> Destination Gateway Genmask Flags MSS Window irtt >> Iface >> 192.168.108.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0 >> 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ib0 >> 239.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 >> 10.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 >> 0.0.0.0 10.0.0.1 0.0.0.0 UG 0 0 0 eth0 >> [pmdtest@compute-01-01 ~]$ exit >> logout >> Connection to compute-01-01 closed. >> [pmdtest@pmd ~]$ ssh compute-01-06 >> Last login: Thu Nov 13 12:06:14 2014 from compute-01-01.private.dns.zone >> [pmdtest@compute-01-06 ~]$ netstat -nr >> Kernel IP routing table >> Destination Gateway Genmask Flags MSS Window irtt >> Iface >> 192.168.108.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0 >> 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ib0 >> 239.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 >> 10.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 >> 0.0.0.0 10.0.0.1 0.0.0.0 UG 0 0 0 eth0 >> [pmdtest@compute-01-06 ~]$ >> <span class="sewlz7y85hpkn4b"><br></span> >> >> On Thu, Nov 13, 2014 at 12:56 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >>> This is really weird ? >>> >>> is the loopback interface up and running on both nodes and with the same >>> ip ? >>> >>> can you run on both compute nodes ? >>> netstat -nr >>> >>> >>> On 2014/11/13 16:50, Syed Ahsan Ali wrote: >>>> Now it looks through the loopback address >>>> >>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca >>>> btl_tcp_if_exclude ib0 ring_c >>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>> [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>> connect() to 127.0.0.1 failed: Connection refused (111) >>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>> [pmd.pakmet.com:30867] 1 more process has sent help message >>>> help-mpi-btl-openib.txt / no active ports found >>>> [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to >>>> 0 to see all help / error messages >>>> >>>> >>>> >>>> On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet >>>> <gilles.gouaillar...@iferc.org> wrote: >>>>> --mca btl ^openib >>>>> disables the openib btl, which is native infiniband only. >>>>> >>>>> ib0 is treated as any TCP interface and then handled by the tcp btl >>>>> >>>>> an other option is you to use >>>>> --mca btl_tcp_if_exclude ib0 >>>>> >>>>> On 2014/11/13 16:43, Syed Ahsan Ali wrote: >>>>>> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$ >>>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca >>>>>> btl_tcp_if_include 10.0.0.0/8 ring_c >>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>>> Process 0 sent to 1 >>>>>> Process 0 decremented value: 9 >>>>>> Process 0 decremented value: 8 >>>>>> Process 0 decremented value: 7 >>>>>> Process 0 decremented value: 6 >>>>>> Process 1 exiting >>>>>> Process 0 decremented value: 5 >>>>>> Process 0 decremented value: 4 >>>>>> Process 0 decremented value: 3 >>>>>> Process 0 decremented value: 2 >>>>>> Process 0 decremented value: 1 >>>>>> Process 0 decremented value: 0 >>>>>> Process 0 exiting >>>>>> [pmdtest@pmd ~]$ >>>>>> >>>>>> While the ip addresses 192.168.108* are for ib interface. >>>>>> >>>>>> [root@compute-01-01 ~]# ifconfig >>>>>> eth0 Link encap:Ethernet HWaddr 00:24:E8:59:4C:2A >>>>>> inet addr:10.0.0.3 Bcast:10.255.255.255 Mask:255.0.0.0 >>>>>> inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link >>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>>>>> RX packets:65588 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:1000 >>>>>> RX bytes:18692977 (17.8 MiB) TX bytes:1834122 (1.7 MiB) >>>>>> Interrupt:169 Memory:dc000000-dc012100 >>>>>> ib0 Link encap:InfiniBand HWaddr >>>>>> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 >>>>>> inet addr:192.168.108.14 Bcast:192.168.108.255 >>>>>> Mask:255.255.255.0 >>>>>> UP BROADCAST MULTICAST MTU:65520 Metric:1 >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:256 >>>>>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >>>>>> >>>>>> >>>>>> >>>>>> So the point is why mpirun is following the ib path while I it has >>>>>> been disabled. Possible solutions? >>>>>> >>>>>> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet >>>>>> <gilles.gouaillar...@iferc.org> wrote: >>>>>>> mpirun complains about the 192.168.108.10 ip address, but ping reports a >>>>>>> 10.0.0.8 address >>>>>>> >>>>>>> is the 192.168.* network a point to point network (for example between a >>>>>>> host and a mic) so two nodes >>>>>>> cannot ping each other via this address ? >>>>>>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of >>>>>>> compute-01-06 ? */ >>>>>>> >>>>>>> could you also run >>>>>>> >>>>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca >>>>>>> btl_tcp_if_include 10.0.0.0/8 ring_c >>>>>>> >>>>>>> and see whether it helps ? >>>>>>> >>>>>>> >>>>>>> On 2014/11/13 16:24, Syed Ahsan Ali wrote: >>>>>>>> Same result in both cases >>>>>>>> >>>>>>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host >>>>>>>> compute-01-01,compute-01-06 ring_c >>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>>>>> Process 0 sent to 1 >>>>>>>> Process 0 decremented value: 9 >>>>>>>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>> >>>>>>>> >>>>>>>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host >>>>>>>> compute-01-01,compute-01-06 ring_c >>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>>>>> Process 0 sent to 1 >>>>>>>> Process 0 decremented value: 9 >>>>>>>> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet >>>>>>>> <gilles.gouaillar...@iferc.org> wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> it seems you messed up the command line >>>>>>>>> >>>>>>>>> could you try >>>>>>>>> >>>>>>>>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c >>>>>>>>> >>>>>>>>> >>>>>>>>> can you also try to run mpirun from a compute node instead of the head >>>>>>>>> node ? >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Gilles >>>>>>>>> >>>>>>>>> On 2014/11/13 16:07, Syed Ahsan Ali wrote: >>>>>>>>>> Here is what I see when disabling openib support.\ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib >>>>>>>>>> compute-01-01,compute-01-06 ring_c >>>>>>>>>> ssh: orted: Temporary failure in name resolution >>>>>>>>>> ssh: orted: Temporary failure in name resolution >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> A daemon (pid 7608) died unexpectedly with status 255 while >>>>>>>>>> attempting >>>>>>>>>> to launch so we are aborting. >>>>>>>>>> >>>>>>>>>> While nodes can still ssh each other >>>>>>>>>> >>>>>>>>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06 >>>>>>>>>> Last login: Thu Nov 13 12:05:58 2014 from >>>>>>>>>> compute-01-01.private.dns.zone >>>>>>>>>> [pmdtest@compute-01-06 ~]$ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali >>>>>>>>>> <ahsansha...@gmail.com> wrote: >>>>>>>>>>> Hi Jefff >>>>>>>>>>> >>>>>>>>>>> No firewall is enabled. Running the diagnostics I found that non >>>>>>>>>>> communication mpi job is running . While ring_c remains stuck. There >>>>>>>>>>> are of course warnings for open fabrics but in my case I an running >>>>>>>>>>> application by disabling openib., Please see below >>>>>>>>>>> >>>>>>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 >>>>>>>>>>> hello_c.out >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> WARNING: There is at least one OpenFabrics device found but there >>>>>>>>>>> are >>>>>>>>>>> no active ports detected (or Open MPI was unable to use them). This >>>>>>>>>>> is most certainly not what you wanted. Check your cables, subnet >>>>>>>>>>> manager configuration, etc. The openib BTL will be ignored for this >>>>>>>>>>> job. >>>>>>>>>>> Local host: compute-01-01.private.dns.zone >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> Hello, world, I am 0 of 2 >>>>>>>>>>> Hello, world, I am 1 of 2 >>>>>>>>>>> [pmd.pakmet.com:06386] 1 more process has sent help message >>>>>>>>>>> help-mpi-btl-openib.txt / no active ports found >>>>>>>>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" >>>>>>>>>>> to >>>>>>>>>>> 0 to see all help / error messages >>>>>>>>>>> >>>>>>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> WARNING: There is at least one OpenFabrics device found but there >>>>>>>>>>> are >>>>>>>>>>> no active ports detected (or Open MPI was unable to use them). This >>>>>>>>>>> is most certainly not what you wanted. Check your cables, subnet >>>>>>>>>>> manager configuration, etc. The openib BTL will be ignored for this >>>>>>>>>>> job. >>>>>>>>>>> Local host: compute-01-01.private.dns.zone >>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>>>>>>>> Process 0 sent to 1 >>>>>>>>>>> Process 0 decremented value: 9 >>>>>>>>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>>>> [pmd.pakmet.com:15965] 1 more process has sent help message >>>>>>>>>>> help-mpi-btl-openib.txt / no active ports found >>>>>>>>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" >>>>>>>>>>> to >>>>>>>>>>> 0 to see all help / error messages >>>>>>>>>>> <span class="sewh9wyhn1gq30p"><br></span> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres) >>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>> Do you have firewalling enabled on either server? >>>>>>>>>>>> >>>>>>>>>>>> See this FAQ item: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali >>>>>>>>>>>> <ahsansha...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Dear All >>>>>>>>>>>>> >>>>>>>>>>>>> I need your advice. While trying to run mpirun job across nodes I >>>>>>>>>>>>> get >>>>>>>>>>>>> following error. It seems that the two nodes i.e, compute-01-01 >>>>>>>>>>>>> and >>>>>>>>>>>>> compute-01-06 are not able to communicate with each other. While >>>>>>>>>>>>> nodes >>>>>>>>>>>>> see each other on ping. >>>>>>>>>>>>> >>>>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca >>>>>>>>>>>>> btl >>>>>>>>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in >>>>>>>>>>>>> >>>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>>>>>> >>>>>>>>>>>>> mpirun: killing job... >>>>>>>>>>>>> >>>>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01 >>>>>>>>>>>>> Last login: Wed Nov 12 09:48:53 2014 from >>>>>>>>>>>>> pmd-eth0.private.dns.zone >>>>>>>>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06 >>>>>>>>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of >>>>>>>>>>>>> data. >>>>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): >>>>>>>>>>>>> icmp_seq=1 >>>>>>>>>>>>> ttl=64 time=0.108 ms >>>>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): >>>>>>>>>>>>> icmp_seq=2 >>>>>>>>>>>>> ttl=64 time=0.088 ms >>>>>>>>>>>>> >>>>>>>>>>>>> --- compute-01-06.private.dns.zone ping statistics --- >>>>>>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms >>>>>>>>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms >>>>>>>>>>>>> [pmdtest@compute-01-01 ~]$ >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks in advance. >>>>>>>>>>>>> >>>>>>>>>>>>> Ahsan >>>>>>>>>>>>> _______________________________________________ >>>>>>>> _______________________________________________ >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/11/25792.php >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/11/25794.php >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25796.php -- Syed Ahsan Ali Bokhari Electronic Engineer (EE) Research & Development Division Pakistan Meteorological Department H-8/4, Islamabad. Phone # off +92518358714 Cell # +923155145014