but it is running on your head node isnt't it ?

you might want to double check why there is no loopback interface on
your compute nodes.
in the mean time, you can disable lo and ib0 interfaces

Cheers,

Gilles

On 2014/11/13 16:59, Syed Ahsan Ali wrote:
>  I don't see it running
>
> [pmdtest@compute-01-01 ~]$ netstat -nr
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> 192.168.108.0   0.0.0.0         255.255.255.0   U         0 0          0 ib0
> 169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ib0
> 239.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 eth0
> 10.0.0.0        0.0.0.0         255.0.0.0       U         0 0          0 eth0
> 0.0.0.0         10.0.0.1        0.0.0.0         UG        0 0          0 eth0
> [pmdtest@compute-01-01 ~]$ exit
> logout
> Connection to compute-01-01 closed.
> [pmdtest@pmd ~]$ ssh compute-01-06
> Last login: Thu Nov 13 12:06:14 2014 from compute-01-01.private.dns.zone
> [pmdtest@compute-01-06 ~]$ netstat -nr
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> 192.168.108.0   0.0.0.0         255.255.255.0   U         0 0          0 ib0
> 169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ib0
> 239.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 eth0
> 10.0.0.0        0.0.0.0         255.0.0.0       U         0 0          0 eth0
> 0.0.0.0         10.0.0.1        0.0.0.0         UG        0 0          0 eth0
> [pmdtest@compute-01-06 ~]$
> <span class="sewlz7y85hpkn4b"><br></span>
>
> On Thu, Nov 13, 2014 at 12:56 PM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>> This is really weird ?
>>
>> is the loopback interface up and running on both nodes and with the same
>> ip ?
>>
>> can you run on both compute nodes ?
>> netstat -nr
>>
>>
>> On 2014/11/13 16:50, Syed Ahsan Ali wrote:
>>> Now it looks through the loopback address
>>>
>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca
>>> btl_tcp_if_exclude ib0 ring_c
>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>> [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 127.0.0.1 failed: Connection refused (111)
>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>> [pmd.pakmet.com:30867] 1 more process has sent help message
>>> help-mpi-btl-openib.txt / no active ports found
>>> [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to
>>> 0 to see all help / error messages
>>>
>>>
>>>
>>> On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet
>>> <gilles.gouaillar...@iferc.org> wrote:
>>>> --mca btl ^openib
>>>> disables the openib btl, which is native infiniband only.
>>>>
>>>> ib0 is treated as any TCP interface and then handled by the tcp btl
>>>>
>>>> an other option is you to use
>>>> --mca btl_tcp_if_exclude ib0
>>>>
>>>> On 2014/11/13 16:43, Syed Ahsan Ali wrote:
>>>>> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$
>>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>> Process 0 sent to 1
>>>>> Process 0 decremented value: 9
>>>>> Process 0 decremented value: 8
>>>>> Process 0 decremented value: 7
>>>>> Process 0 decremented value: 6
>>>>> Process 1 exiting
>>>>> Process 0 decremented value: 5
>>>>> Process 0 decremented value: 4
>>>>> Process 0 decremented value: 3
>>>>> Process 0 decremented value: 2
>>>>> Process 0 decremented value: 1
>>>>> Process 0 decremented value: 0
>>>>> Process 0 exiting
>>>>> [pmdtest@pmd ~]$
>>>>>
>>>>> While the ip addresses 192.168.108* are for ib interface.
>>>>>
>>>>>  [root@compute-01-01 ~]# ifconfig
>>>>> eth0      Link encap:Ethernet  HWaddr 00:24:E8:59:4C:2A
>>>>>           inet addr:10.0.0.3  Bcast:10.255.255.255  Mask:255.0.0.0
>>>>>           inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link
>>>>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>>>           RX packets:65588 errors:0 dropped:0 overruns:0 frame:0
>>>>>           TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0
>>>>>           collisions:0 txqueuelen:1000
>>>>>           RX bytes:18692977 (17.8 MiB)  TX bytes:1834122 (1.7 MiB)
>>>>>           Interrupt:169 Memory:dc000000-dc012100
>>>>> ib0       Link encap:InfiniBand  HWaddr
>>>>> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>>>>           inet addr:192.168.108.14  Bcast:192.168.108.255  
>>>>> Mask:255.255.255.0
>>>>>           UP BROADCAST MULTICAST  MTU:65520  Metric:1
>>>>>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>>>>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>>>>           collisions:0 txqueuelen:256
>>>>>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>>>>>
>>>>>
>>>>>
>>>>> So the point is why mpirun is following the ib  path while I it has
>>>>> been disabled. Possible solutions?
>>>>>
>>>>> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet
>>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>>> mpirun complains about the 192.168.108.10 ip address, but ping reports a
>>>>>> 10.0.0.8 address
>>>>>>
>>>>>> is the 192.168.* network a point to point network (for example between a
>>>>>> host and a mic) so two nodes
>>>>>> cannot ping each other via this address ?
>>>>>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of
>>>>>> compute-01-06 ? */
>>>>>>
>>>>>> could you also run
>>>>>>
>>>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>>>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>>>>>
>>>>>> and see whether it helps ?
>>>>>>
>>>>>>
>>>>>> On 2014/11/13 16:24, Syed Ahsan Ali wrote:
>>>>>>> Same result in both cases
>>>>>>>
>>>>>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
>>>>>>> compute-01-01,compute-01-06 ring_c
>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>>> Process 0 sent to 1
>>>>>>> Process 0 decremented value: 9
>>>>>>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>
>>>>>>>
>>>>>>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host
>>>>>>> compute-01-01,compute-01-06 ring_c
>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>>> Process 0 sent to 1
>>>>>>> Process 0 decremented value: 9
>>>>>>> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet
>>>>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> it seems you messed up the command line
>>>>>>>>
>>>>>>>> could you try
>>>>>>>>
>>>>>>>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c
>>>>>>>>
>>>>>>>>
>>>>>>>> can you also try to run mpirun from a compute node instead of the head
>>>>>>>> node ?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Gilles
>>>>>>>>
>>>>>>>> On 2014/11/13 16:07, Syed Ahsan Ali wrote:
>>>>>>>>> Here is what I see when disabling openib support.\
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib
>>>>>>>>> compute-01-01,compute-01-06 ring_c
>>>>>>>>> ssh:  orted: Temporary failure in name resolution
>>>>>>>>> ssh:  orted: Temporary failure in name resolution
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> A daemon (pid 7608) died unexpectedly with status 255 while attempting
>>>>>>>>> to launch so we are aborting.
>>>>>>>>>
>>>>>>>>> While nodes can still ssh each other
>>>>>>>>>
>>>>>>>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06
>>>>>>>>> Last login: Thu Nov 13 12:05:58 2014 from 
>>>>>>>>> compute-01-01.private.dns.zone
>>>>>>>>> [pmdtest@compute-01-06 ~]$
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali 
>>>>>>>>> <ahsansha...@gmail.com> wrote:
>>>>>>>>>>  Hi Jefff
>>>>>>>>>>
>>>>>>>>>> No firewall is enabled. Running the diagnostics I found that non
>>>>>>>>>> communication mpi job is running . While ring_c remains stuck. There
>>>>>>>>>> are of course warnings for open fabrics but in my case I an running
>>>>>>>>>> application by disabling openib., Please see below
>>>>>>>>>>
>>>>>>>>>>  [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 
>>>>>>>>>> hello_c.out
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> WARNING: There is at least one OpenFabrics device found but there are
>>>>>>>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>>>>>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>>>>>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>>>>>>>> job.
>>>>>>>>>>   Local host: compute-01-01.private.dns.zone
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> Hello, world, I am 0 of 2
>>>>>>>>>> Hello, world, I am 1 of 2
>>>>>>>>>> [pmd.pakmet.com:06386] 1 more process has sent help message
>>>>>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>>>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" 
>>>>>>>>>> to
>>>>>>>>>> 0 to see all help / error messages
>>>>>>>>>>
>>>>>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> WARNING: There is at least one OpenFabrics device found but there are
>>>>>>>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>>>>>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>>>>>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>>>>>>>> job.
>>>>>>>>>>   Local host: compute-01-01.private.dns.zone
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>>>>>> Process 0 sent to 1
>>>>>>>>>> Process 0 decremented value: 9
>>>>>>>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>> [pmd.pakmet.com:15965] 1 more process has sent help message
>>>>>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>>>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" 
>>>>>>>>>> to
>>>>>>>>>> 0 to see all help / error messages
>>>>>>>>>> <span class="sewh9wyhn1gq30p"><br></span>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres)
>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>> Do you have firewalling enabled on either server?
>>>>>>>>>>>
>>>>>>>>>>> See this FAQ item:
>>>>>>>>>>>
>>>>>>>>>>>     
>>>>>>>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Dear All
>>>>>>>>>>>>
>>>>>>>>>>>> I need your advice. While trying to run mpirun job across nodes I 
>>>>>>>>>>>> get
>>>>>>>>>>>> following error. It seems that the two nodes i.e, compute-01-01 and
>>>>>>>>>>>> compute-01-06 are not able to communicate with each other. While 
>>>>>>>>>>>> nodes
>>>>>>>>>>>> see each other on ping.
>>>>>>>>>>>>
>>>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl
>>>>>>>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in
>>>>>>>>>>>>
>>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>>>
>>>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>>>
>>>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01
>>>>>>>>>>>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone
>>>>>>>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06
>>>>>>>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of 
>>>>>>>>>>>> data.
>>>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1
>>>>>>>>>>>> ttl=64 time=0.108 ms
>>>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2
>>>>>>>>>>>> ttl=64 time=0.088 ms
>>>>>>>>>>>>
>>>>>>>>>>>> --- compute-01-06.private.dns.zone ping statistics ---
>>>>>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>>>>>>>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms
>>>>>>>>>>>> [pmdtest@compute-01-01 ~]$
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>>
>>>>>>>>>>>> Ahsan
>>>>>>>>>>>> _______________________________________________
>>>>>>> _______________________________________________
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/11/25792.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25794.php
>
>

Reply via email to