mpirun complains about the 192.168.108.10 ip address, but ping reports a 10.0.0.8 address
is the 192.168.* network a point to point network (for example between a host and a mic) so two nodes cannot ping each other via this address ? /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of compute-01-06 ? */ could you also run mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca btl_tcp_if_include 10.0.0.0/8 ring_c and see whether it helps ? On 2014/11/13 16:24, Syed Ahsan Ali wrote: > Same result in both cases > > [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host > compute-01-01,compute-01-06 ring_c > Process 0 sending 10 to 1, tag 201 (2 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.10 failed: No route to host (113) > > > [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host > compute-01-01,compute-01-06 ring_c > Process 0 sending 10 to 1, tag 201 (2 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.10 failed: No route to host (113) > > > On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: >> Hi, >> >> it seems you messed up the command line >> >> could you try >> >> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c >> >> >> can you also try to run mpirun from a compute node instead of the head >> node ? >> >> Cheers, >> >> Gilles >> >> On 2014/11/13 16:07, Syed Ahsan Ali wrote: >>> Here is what I see when disabling openib support.\ >>> >>> >>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib >>> compute-01-01,compute-01-06 ring_c >>> ssh: orted: Temporary failure in name resolution >>> ssh: orted: Temporary failure in name resolution >>> -------------------------------------------------------------------------- >>> A daemon (pid 7608) died unexpectedly with status 255 while attempting >>> to launch so we are aborting. >>> >>> While nodes can still ssh each other >>> >>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06 >>> Last login: Thu Nov 13 12:05:58 2014 from compute-01-01.private.dns.zone >>> [pmdtest@compute-01-06 ~]$ >>> >>> >>> >>> >>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali <ahsansha...@gmail.com> >>> wrote: >>>> Hi Jefff >>>> >>>> No firewall is enabled. Running the diagnostics I found that non >>>> communication mpi job is running . While ring_c remains stuck. There >>>> are of course warnings for open fabrics but in my case I an running >>>> application by disabling openib., Please see below >>>> >>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out >>>> -------------------------------------------------------------------------- >>>> WARNING: There is at least one OpenFabrics device found but there are >>>> no active ports detected (or Open MPI was unable to use them). This >>>> is most certainly not what you wanted. Check your cables, subnet >>>> manager configuration, etc. The openib BTL will be ignored for this >>>> job. >>>> Local host: compute-01-01.private.dns.zone >>>> -------------------------------------------------------------------------- >>>> Hello, world, I am 0 of 2 >>>> Hello, world, I am 1 of 2 >>>> [pmd.pakmet.com:06386] 1 more process has sent help message >>>> help-mpi-btl-openib.txt / no active ports found >>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to >>>> 0 to see all help / error messages >>>> >>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c >>>> -------------------------------------------------------------------------- >>>> WARNING: There is at least one OpenFabrics device found but there are >>>> no active ports detected (or Open MPI was unable to use them). This >>>> is most certainly not what you wanted. Check your cables, subnet >>>> manager configuration, etc. The openib BTL will be ignored for this >>>> job. >>>> Local host: compute-01-01.private.dns.zone >>>> -------------------------------------------------------------------------- >>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>> Process 0 sent to 1 >>>> Process 0 decremented value: 9 >>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>> connect() to 192.168.108.10 failed: No route to host (113) >>>> [pmd.pakmet.com:15965] 1 more process has sent help message >>>> help-mpi-btl-openib.txt / no active ports found >>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to >>>> 0 to see all help / error messages >>>> <span class="sewh9wyhn1gq30p"><br></span> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres) >>>> <jsquy...@cisco.com> wrote: >>>>> Do you have firewalling enabled on either server? >>>>> >>>>> See this FAQ item: >>>>> >>>>> >>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems >>>>> >>>>> >>>>> >>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote: >>>>> >>>>>> Dear All >>>>>> >>>>>> I need your advice. While trying to run mpirun job across nodes I get >>>>>> following error. It seems that the two nodes i.e, compute-01-01 and >>>>>> compute-01-06 are not able to communicate with each other. While nodes >>>>>> see each other on ping. >>>>>> >>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl >>>>>> ^openib ../bin/regcmMPICLM45 regcm.in >>>>>> >>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>> >>>>>> mpirun: killing job... >>>>>> >>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01 >>>>>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone >>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06 >>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data. >>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1 >>>>>> ttl=64 time=0.108 ms >>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2 >>>>>> ttl=64 time=0.088 ms >>>>>> >>>>>> --- compute-01-06.private.dns.zone ping statistics --- >>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms >>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms >>>>>> [pmdtest@compute-01-01 ~]$ >>>>>> >>>>>> Thanks in advance. >>>>>> >>>>>> Ahsan >>>>>> _______________________________________________ > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25788.php