mpirun complains about the 192.168.108.10 ip address, but ping reports a
10.0.0.8 address

is the 192.168.* network a point to point network (for example between a
host and a mic) so two nodes
cannot ping each other via this address ?
/* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of
compute-01-06 ? */

could you also run

mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
btl_tcp_if_include 10.0.0.0/8 ring_c

and see whether it helps ?


On 2014/11/13 16:24, Syed Ahsan Ali wrote:
> Same result in both cases
>
> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
> compute-01-01,compute-01-06 ring_c
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
>
>
> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host
> compute-01-01,compute-01-06 ring_c
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
>
>
> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>> Hi,
>>
>> it seems you messed up the command line
>>
>> could you try
>>
>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c
>>
>>
>> can you also try to run mpirun from a compute node instead of the head
>> node ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/11/13 16:07, Syed Ahsan Ali wrote:
>>> Here is what I see when disabling openib support.\
>>>
>>>
>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib
>>> compute-01-01,compute-01-06 ring_c
>>> ssh:  orted: Temporary failure in name resolution
>>> ssh:  orted: Temporary failure in name resolution
>>> --------------------------------------------------------------------------
>>> A daemon (pid 7608) died unexpectedly with status 255 while attempting
>>> to launch so we are aborting.
>>>
>>> While nodes can still ssh each other
>>>
>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06
>>> Last login: Thu Nov 13 12:05:58 2014 from compute-01-01.private.dns.zone
>>> [pmdtest@compute-01-06 ~]$
>>>
>>>
>>>
>>>
>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali <ahsansha...@gmail.com> 
>>> wrote:
>>>>  Hi Jefff
>>>>
>>>> No firewall is enabled. Running the diagnostics I found that non
>>>> communication mpi job is running . While ring_c remains stuck. There
>>>> are of course warnings for open fabrics but in my case I an running
>>>> application by disabling openib., Please see below
>>>>
>>>>  [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out
>>>> --------------------------------------------------------------------------
>>>> WARNING: There is at least one OpenFabrics device found but there are
>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>> job.
>>>>   Local host: compute-01-01.private.dns.zone
>>>> --------------------------------------------------------------------------
>>>> Hello, world, I am 0 of 2
>>>> Hello, world, I am 1 of 2
>>>> [pmd.pakmet.com:06386] 1 more process has sent help message
>>>> help-mpi-btl-openib.txt / no active ports found
>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to
>>>> 0 to see all help / error messages
>>>>
>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c
>>>> --------------------------------------------------------------------------
>>>> WARNING: There is at least one OpenFabrics device found but there are
>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>> job.
>>>>   Local host: compute-01-01.private.dns.zone
>>>> --------------------------------------------------------------------------
>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>> Process 0 sent to 1
>>>> Process 0 decremented value: 9
>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>> [pmd.pakmet.com:15965] 1 more process has sent help message
>>>> help-mpi-btl-openib.txt / no active ports found
>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to
>>>> 0 to see all help / error messages
>>>> <span class="sewh9wyhn1gq30p"><br></span>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres)
>>>> <jsquy...@cisco.com> wrote:
>>>>> Do you have firewalling enabled on either server?
>>>>>
>>>>> See this FAQ item:
>>>>>
>>>>>     
>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>>>>
>>>>>
>>>>>
>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
>>>>>
>>>>>> Dear All
>>>>>>
>>>>>> I need your advice. While trying to run mpirun job across nodes I get
>>>>>> following error. It seems that the two nodes i.e, compute-01-01 and
>>>>>> compute-01-06 are not able to communicate with each other. While nodes
>>>>>> see each other on ping.
>>>>>>
>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl
>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in
>>>>>>
>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>
>>>>>> mpirun: killing job...
>>>>>>
>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01
>>>>>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone
>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06
>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data.
>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1
>>>>>> ttl=64 time=0.108 ms
>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2
>>>>>> ttl=64 time=0.088 ms
>>>>>>
>>>>>> --- compute-01-06.private.dns.zone ping statistics ---
>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms
>>>>>> [pmdtest@compute-01-01 ~]$
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>> Ahsan
>>>>>> _______________________________________________
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25788.php

Reply via email to