Re: [OMPI users] mpirun fails across nodes

Syed Ahsan Ali Thu, 13 Nov 2014 03:05:49 -0500 (EST)

netstat don't show loopback interface even on head node while ifconfig
shows Loopback up and running on compute nodes as well as master node.


[root@pmd ~]# netstat -nr
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.3.0     0.0.0.0         255.255.255.0   U         0 0          0 eth1
192.168.108.0   0.0.0.0         255.255.255.0   U         0 0          0 ib0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ib0
239.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 eth0
10.0.0.0        0.0.0.0         255.0.0.0       U         0 0          0 eth0
0.0.0.0         192.168.3.1     0.0.0.0         UG        0 0          0 eth1

[root@compute-01-01 ~]# ifconfig
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:880 errors:0 dropped:0 overruns:0 frame:0
          TX packets:880 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:150329 (146.8 KiB)  TX bytes:150329 (146.8 KiB)



On Thu, Nov 13, 2014 at 1:02 PM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org> wrote:
> but it is running on your head node isnt't it ?
>
> you might want to double check why there is no loopback interface on
> your compute nodes.
> in the mean time, you can disable lo and ib0 interfaces
>
> Cheers,
>
> Gilles
>
> On 2014/11/13 16:59, Syed Ahsan Ali wrote:
>>  I don't see it running
>>
>> [pmdtest@compute-01-01 ~]$ netstat -nr
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags   MSS Window  irtt 
>> Iface
>> 192.168.108.0   0.0.0.0         255.255.255.0   U         0 0          0 ib0
>> 169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ib0
>> 239.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 eth0
>> 10.0.0.0        0.0.0.0         255.0.0.0       U         0 0          0 eth0
>> 0.0.0.0         10.0.0.1        0.0.0.0         UG        0 0          0 eth0
>> [pmdtest@compute-01-01 ~]$ exit
>> logout
>> Connection to compute-01-01 closed.
>> [pmdtest@pmd ~]$ ssh compute-01-06
>> Last login: Thu Nov 13 12:06:14 2014 from compute-01-01.private.dns.zone
>> [pmdtest@compute-01-06 ~]$ netstat -nr
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags   MSS Window  irtt 
>> Iface
>> 192.168.108.0   0.0.0.0         255.255.255.0   U         0 0          0 ib0
>> 169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ib0
>> 239.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 eth0
>> 10.0.0.0        0.0.0.0         255.0.0.0       U         0 0          0 eth0
>> 0.0.0.0         10.0.0.1        0.0.0.0         UG        0 0          0 eth0
>> [pmdtest@compute-01-06 ~]$
>> <span class="sewlz7y85hpkn4b"><br></span>
>>
>> On Thu, Nov 13, 2014 at 12:56 PM, Gilles Gouaillardet
>> <gilles.gouaillar...@iferc.org> wrote:
>>> This is really weird ?
>>>
>>> is the loopback interface up and running on both nodes and with the same
>>> ip ?
>>>
>>> can you run on both compute nodes ?
>>> netstat -nr
>>>
>>>
>>> On 2014/11/13 16:50, Syed Ahsan Ali wrote:
>>>> Now it looks through the loopback address
>>>>
>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca
>>>> btl_tcp_if_exclude ib0 ring_c
>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>> [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() to 127.0.0.1 failed: Connection refused (111)
>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>> [pmd.pakmet.com:30867] 1 more process has sent help message
>>>> help-mpi-btl-openib.txt / no active ports found
>>>> [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to
>>>> 0 to see all help / error messages
>>>>
>>>>
>>>>
>>>> On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet
>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>> --mca btl ^openib
>>>>> disables the openib btl, which is native infiniband only.
>>>>>
>>>>> ib0 is treated as any TCP interface and then handled by the tcp btl
>>>>>
>>>>> an other option is you to use
>>>>> --mca btl_tcp_if_exclude ib0
>>>>>
>>>>> On 2014/11/13 16:43, Syed Ahsan Ali wrote:
>>>>>> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$
>>>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>>>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>> Process 0 sent to 1
>>>>>> Process 0 decremented value: 9
>>>>>> Process 0 decremented value: 8
>>>>>> Process 0 decremented value: 7
>>>>>> Process 0 decremented value: 6
>>>>>> Process 1 exiting
>>>>>> Process 0 decremented value: 5
>>>>>> Process 0 decremented value: 4
>>>>>> Process 0 decremented value: 3
>>>>>> Process 0 decremented value: 2
>>>>>> Process 0 decremented value: 1
>>>>>> Process 0 decremented value: 0
>>>>>> Process 0 exiting
>>>>>> [pmdtest@pmd ~]$
>>>>>>
>>>>>> While the ip addresses 192.168.108* are for ib interface.
>>>>>>
>>>>>>  [root@compute-01-01 ~]# ifconfig
>>>>>> eth0      Link encap:Ethernet  HWaddr 00:24:E8:59:4C:2A
>>>>>>           inet addr:10.0.0.3  Bcast:10.255.255.255  Mask:255.0.0.0
>>>>>>           inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link
>>>>>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>>>>           RX packets:65588 errors:0 dropped:0 overruns:0 frame:0
>>>>>>           TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0
>>>>>>           collisions:0 txqueuelen:1000
>>>>>>           RX bytes:18692977 (17.8 MiB)  TX bytes:1834122 (1.7 MiB)
>>>>>>           Interrupt:169 Memory:dc000000-dc012100
>>>>>> ib0       Link encap:InfiniBand  HWaddr
>>>>>> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>>>>>           inet addr:192.168.108.14  Bcast:192.168.108.255  
>>>>>> Mask:255.255.255.0
>>>>>>           UP BROADCAST MULTICAST  MTU:65520  Metric:1
>>>>>>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>>>>>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>>>>>           collisions:0 txqueuelen:256
>>>>>>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>>>>>>
>>>>>>
>>>>>>
>>>>>> So the point is why mpirun is following the ib  path while I it has
>>>>>> been disabled. Possible solutions?
>>>>>>
>>>>>> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet
>>>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>>>> mpirun complains about the 192.168.108.10 ip address, but ping reports a
>>>>>>> 10.0.0.8 address
>>>>>>>
>>>>>>> is the 192.168.* network a point to point network (for example between a
>>>>>>> host and a mic) so two nodes
>>>>>>> cannot ping each other via this address ?
>>>>>>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of
>>>>>>> compute-01-06 ? */
>>>>>>>
>>>>>>> could you also run
>>>>>>>
>>>>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>>>>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>>>>>>
>>>>>>> and see whether it helps ?
>>>>>>>
>>>>>>>
>>>>>>> On 2014/11/13 16:24, Syed Ahsan Ali wrote:
>>>>>>>> Same result in both cases
>>>>>>>>
>>>>>>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
>>>>>>>> compute-01-01,compute-01-06 ring_c
>>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>>>> Process 0 sent to 1
>>>>>>>> Process 0 decremented value: 9
>>>>>>>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>
>>>>>>>>
>>>>>>>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host
>>>>>>>> compute-01-01,compute-01-06 ring_c
>>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>>>> Process 0 sent to 1
>>>>>>>> Process 0 decremented value: 9
>>>>>>>> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet
>>>>>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> it seems you messed up the command line
>>>>>>>>>
>>>>>>>>> could you try
>>>>>>>>>
>>>>>>>>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> can you also try to run mpirun from a compute node instead of the head
>>>>>>>>> node ?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Gilles
>>>>>>>>>
>>>>>>>>> On 2014/11/13 16:07, Syed Ahsan Ali wrote:
>>>>>>>>>> Here is what I see when disabling openib support.\
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib
>>>>>>>>>> compute-01-01,compute-01-06 ring_c
>>>>>>>>>> ssh:  orted: Temporary failure in name resolution
>>>>>>>>>> ssh:  orted: Temporary failure in name resolution
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> A daemon (pid 7608) died unexpectedly with status 255 while 
>>>>>>>>>> attempting
>>>>>>>>>> to launch so we are aborting.
>>>>>>>>>>
>>>>>>>>>> While nodes can still ssh each other
>>>>>>>>>>
>>>>>>>>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06
>>>>>>>>>> Last login: Thu Nov 13 12:05:58 2014 from 
>>>>>>>>>> compute-01-01.private.dns.zone
>>>>>>>>>> [pmdtest@compute-01-06 ~]$
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali 
>>>>>>>>>> <ahsansha...@gmail.com> wrote:
>>>>>>>>>>>  Hi Jefff
>>>>>>>>>>>
>>>>>>>>>>> No firewall is enabled. Running the diagnostics I found that non
>>>>>>>>>>> communication mpi job is running . While ring_c remains stuck. There
>>>>>>>>>>> are of course warnings for open fabrics but in my case I an running
>>>>>>>>>>> application by disabling openib., Please see below
>>>>>>>>>>>
>>>>>>>>>>>  [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 
>>>>>>>>>>> hello_c.out
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>> WARNING: There is at least one OpenFabrics device found but there 
>>>>>>>>>>> are
>>>>>>>>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>>>>>>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>>>>>>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>>>>>>>>> job.
>>>>>>>>>>>   Local host: compute-01-01.private.dns.zone
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>> Hello, world, I am 0 of 2
>>>>>>>>>>> Hello, world, I am 1 of 2
>>>>>>>>>>> [pmd.pakmet.com:06386] 1 more process has sent help message
>>>>>>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>>>>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" 
>>>>>>>>>>> to
>>>>>>>>>>> 0 to see all help / error messages
>>>>>>>>>>>
>>>>>>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>> WARNING: There is at least one OpenFabrics device found but there 
>>>>>>>>>>> are
>>>>>>>>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>>>>>>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>>>>>>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>>>>>>>>> job.
>>>>>>>>>>>   Local host: compute-01-01.private.dns.zone
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>>>>>>> Process 0 sent to 1
>>>>>>>>>>> Process 0 decremented value: 9
>>>>>>>>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>> [pmd.pakmet.com:15965] 1 more process has sent help message
>>>>>>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>>>>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" 
>>>>>>>>>>> to
>>>>>>>>>>> 0 to see all help / error messages
>>>>>>>>>>> <span class="sewh9wyhn1gq30p"><br></span>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres)
>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>> Do you have firewalling enabled on either server?
>>>>>>>>>>>>
>>>>>>>>>>>> See this FAQ item:
>>>>>>>>>>>>
>>>>>>>>>>>>     
>>>>>>>>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali 
>>>>>>>>>>>> <ahsansha...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Dear All
>>>>>>>>>>>>>
>>>>>>>>>>>>> I need your advice. While trying to run mpirun job across nodes I 
>>>>>>>>>>>>> get
>>>>>>>>>>>>> following error. It seems that the two nodes i.e, compute-01-01 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> compute-01-06 are not able to communicate with each other. While 
>>>>>>>>>>>>> nodes
>>>>>>>>>>>>> see each other on ping.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca 
>>>>>>>>>>>>> btl
>>>>>>>>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in
>>>>>>>>>>>>>
>>>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>>>>
>>>>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>>>>
>>>>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01
>>>>>>>>>>>>> Last login: Wed Nov 12 09:48:53 2014 from 
>>>>>>>>>>>>> pmd-eth0.private.dns.zone
>>>>>>>>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06
>>>>>>>>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of 
>>>>>>>>>>>>> data.
>>>>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): 
>>>>>>>>>>>>> icmp_seq=1
>>>>>>>>>>>>> ttl=64 time=0.108 ms
>>>>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): 
>>>>>>>>>>>>> icmp_seq=2
>>>>>>>>>>>>> ttl=64 time=0.088 ms
>>>>>>>>>>>>>
>>>>>>>>>>>>> --- compute-01-06.private.dns.zone ping statistics ---
>>>>>>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>>>>>>>>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms
>>>>>>>>>>>>> [pmdtest@compute-01-01 ~]$
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ahsan
>>>>>>>>>>>>> _______________________________________________
>>>>>>>> _______________________________________________
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/11/25792.php
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/11/25794.php
>>
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25796.php



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Re: [OMPI users] mpirun fails across nodes

Reply via email to