I don't see it running

[pmdtest@compute-01-01 ~]$ netstat -nr
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.108.0   0.0.0.0         255.255.255.0   U         0 0          0 ib0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ib0
239.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 eth0
10.0.0.0        0.0.0.0         255.0.0.0       U         0 0          0 eth0
0.0.0.0         10.0.0.1        0.0.0.0         UG        0 0          0 eth0
[pmdtest@compute-01-01 ~]$ exit
logout
Connection to compute-01-01 closed.
[pmdtest@pmd ~]$ ssh compute-01-06
Last login: Thu Nov 13 12:06:14 2014 from compute-01-01.private.dns.zone
[pmdtest@compute-01-06 ~]$ netstat -nr
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.108.0   0.0.0.0         255.255.255.0   U         0 0          0 ib0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ib0
239.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 eth0
10.0.0.0        0.0.0.0         255.0.0.0       U         0 0          0 eth0
0.0.0.0         10.0.0.1        0.0.0.0         UG        0 0          0 eth0
[pmdtest@compute-01-06 ~]$
<span class="sewlz7y85hpkn4b"><br></span>

On Thu, Nov 13, 2014 at 12:56 PM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org> wrote:
> This is really weird ?
>
> is the loopback interface up and running on both nodes and with the same
> ip ?
>
> can you run on both compute nodes ?
> netstat -nr
>
>
> On 2014/11/13 16:50, Syed Ahsan Ali wrote:
>> Now it looks through the loopback address
>>
>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca
>> btl_tcp_if_exclude ib0 ring_c
>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>> [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 127.0.0.1 failed: Connection refused (111)
>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>> [pmd.pakmet.com:30867] 1 more process has sent help message
>> help-mpi-btl-openib.txt / no active ports found
>> [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to
>> 0 to see all help / error messages
>>
>>
>>
>> On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet
>> <gilles.gouaillar...@iferc.org> wrote:
>>> --mca btl ^openib
>>> disables the openib btl, which is native infiniband only.
>>>
>>> ib0 is treated as any TCP interface and then handled by the tcp btl
>>>
>>> an other option is you to use
>>> --mca btl_tcp_if_exclude ib0
>>>
>>> On 2014/11/13 16:43, Syed Ahsan Ali wrote:
>>>> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$
>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>> Process 0 sent to 1
>>>> Process 0 decremented value: 9
>>>> Process 0 decremented value: 8
>>>> Process 0 decremented value: 7
>>>> Process 0 decremented value: 6
>>>> Process 1 exiting
>>>> Process 0 decremented value: 5
>>>> Process 0 decremented value: 4
>>>> Process 0 decremented value: 3
>>>> Process 0 decremented value: 2
>>>> Process 0 decremented value: 1
>>>> Process 0 decremented value: 0
>>>> Process 0 exiting
>>>> [pmdtest@pmd ~]$
>>>>
>>>> While the ip addresses 192.168.108* are for ib interface.
>>>>
>>>>  [root@compute-01-01 ~]# ifconfig
>>>> eth0      Link encap:Ethernet  HWaddr 00:24:E8:59:4C:2A
>>>>           inet addr:10.0.0.3  Bcast:10.255.255.255  Mask:255.0.0.0
>>>>           inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link
>>>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>>           RX packets:65588 errors:0 dropped:0 overruns:0 frame:0
>>>>           TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0
>>>>           collisions:0 txqueuelen:1000
>>>>           RX bytes:18692977 (17.8 MiB)  TX bytes:1834122 (1.7 MiB)
>>>>           Interrupt:169 Memory:dc000000-dc012100
>>>> ib0       Link encap:InfiniBand  HWaddr
>>>> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>>>           inet addr:192.168.108.14  Bcast:192.168.108.255  
>>>> Mask:255.255.255.0
>>>>           UP BROADCAST MULTICAST  MTU:65520  Metric:1
>>>>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>>>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>>>           collisions:0 txqueuelen:256
>>>>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>>>>
>>>>
>>>>
>>>> So the point is why mpirun is following the ib  path while I it has
>>>> been disabled. Possible solutions?
>>>>
>>>> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet
>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>> mpirun complains about the 192.168.108.10 ip address, but ping reports a
>>>>> 10.0.0.8 address
>>>>>
>>>>> is the 192.168.* network a point to point network (for example between a
>>>>> host and a mic) so two nodes
>>>>> cannot ping each other via this address ?
>>>>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of
>>>>> compute-01-06 ? */
>>>>>
>>>>> could you also run
>>>>>
>>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>>>>
>>>>> and see whether it helps ?
>>>>>
>>>>>
>>>>> On 2014/11/13 16:24, Syed Ahsan Ali wrote:
>>>>>> Same result in both cases
>>>>>>
>>>>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
>>>>>> compute-01-01,compute-01-06 ring_c
>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>> Process 0 sent to 1
>>>>>> Process 0 decremented value: 9
>>>>>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>
>>>>>>
>>>>>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host
>>>>>> compute-01-01,compute-01-06 ring_c
>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>> Process 0 sent to 1
>>>>>> Process 0 decremented value: 9
>>>>>> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet
>>>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> it seems you messed up the command line
>>>>>>>
>>>>>>> could you try
>>>>>>>
>>>>>>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c
>>>>>>>
>>>>>>>
>>>>>>> can you also try to run mpirun from a compute node instead of the head
>>>>>>> node ?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Gilles
>>>>>>>
>>>>>>> On 2014/11/13 16:07, Syed Ahsan Ali wrote:
>>>>>>>> Here is what I see when disabling openib support.\
>>>>>>>>
>>>>>>>>
>>>>>>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib
>>>>>>>> compute-01-01,compute-01-06 ring_c
>>>>>>>> ssh:  orted: Temporary failure in name resolution
>>>>>>>> ssh:  orted: Temporary failure in name resolution
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> A daemon (pid 7608) died unexpectedly with status 255 while attempting
>>>>>>>> to launch so we are aborting.
>>>>>>>>
>>>>>>>> While nodes can still ssh each other
>>>>>>>>
>>>>>>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06
>>>>>>>> Last login: Thu Nov 13 12:05:58 2014 from 
>>>>>>>> compute-01-01.private.dns.zone
>>>>>>>> [pmdtest@compute-01-06 ~]$
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali 
>>>>>>>> <ahsansha...@gmail.com> wrote:
>>>>>>>>>  Hi Jefff
>>>>>>>>>
>>>>>>>>> No firewall is enabled. Running the diagnostics I found that non
>>>>>>>>> communication mpi job is running . While ring_c remains stuck. There
>>>>>>>>> are of course warnings for open fabrics but in my case I an running
>>>>>>>>> application by disabling openib., Please see below
>>>>>>>>>
>>>>>>>>>  [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 
>>>>>>>>> hello_c.out
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> WARNING: There is at least one OpenFabrics device found but there are
>>>>>>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>>>>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>>>>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>>>>>>> job.
>>>>>>>>>   Local host: compute-01-01.private.dns.zone
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> Hello, world, I am 0 of 2
>>>>>>>>> Hello, world, I am 1 of 2
>>>>>>>>> [pmd.pakmet.com:06386] 1 more process has sent help message
>>>>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to
>>>>>>>>> 0 to see all help / error messages
>>>>>>>>>
>>>>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> WARNING: There is at least one OpenFabrics device found but there are
>>>>>>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>>>>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>>>>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>>>>>>> job.
>>>>>>>>>   Local host: compute-01-01.private.dns.zone
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>>>>> Process 0 sent to 1
>>>>>>>>> Process 0 decremented value: 9
>>>>>>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>> [pmd.pakmet.com:15965] 1 more process has sent help message
>>>>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to
>>>>>>>>> 0 to see all help / error messages
>>>>>>>>> <span class="sewh9wyhn1gq30p"><br></span>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres)
>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>> Do you have firewalling enabled on either server?
>>>>>>>>>>
>>>>>>>>>> See this FAQ item:
>>>>>>>>>>
>>>>>>>>>>     
>>>>>>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Dear All
>>>>>>>>>>>
>>>>>>>>>>> I need your advice. While trying to run mpirun job across nodes I 
>>>>>>>>>>> get
>>>>>>>>>>> following error. It seems that the two nodes i.e, compute-01-01 and
>>>>>>>>>>> compute-01-06 are not able to communicate with each other. While 
>>>>>>>>>>> nodes
>>>>>>>>>>> see each other on ping.
>>>>>>>>>>>
>>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl
>>>>>>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in
>>>>>>>>>>>
>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>>
>>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>>
>>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01
>>>>>>>>>>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone
>>>>>>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06
>>>>>>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data.
>>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1
>>>>>>>>>>> ttl=64 time=0.108 ms
>>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2
>>>>>>>>>>> ttl=64 time=0.088 ms
>>>>>>>>>>>
>>>>>>>>>>> --- compute-01-06.private.dns.zone ping statistics ---
>>>>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>>>>>>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms
>>>>>>>>>>> [pmdtest@compute-01-01 ~]$
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>
>>>>>>>>>>> Ahsan
>>>>>>>>>>> _______________________________________________
>>>>>> _______________________________________________
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25792.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25794.php



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Reply via email to