Now it looks through the loopback address [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca btl_tcp_if_exclude ib0 ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 127.0.0.1 failed: Connection refused (111) Process 0 sending 10 to 1, tag 201 (2 processes in ring) [pmd.pakmet.com:30867] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > --mca btl ^openib > disables the openib btl, which is native infiniband only. > > ib0 is treated as any TCP interface and then handled by the tcp btl > > an other option is you to use > --mca btl_tcp_if_exclude ib0 > > On 2014/11/13 16:43, Syed Ahsan Ali wrote: >> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$ >> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca >> btl_tcp_if_include 10.0.0.0/8 ring_c >> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >> Process 0 sent to 1 >> Process 0 decremented value: 9 >> Process 0 decremented value: 8 >> Process 0 decremented value: 7 >> Process 0 decremented value: 6 >> Process 1 exiting >> Process 0 decremented value: 5 >> Process 0 decremented value: 4 >> Process 0 decremented value: 3 >> Process 0 decremented value: 2 >> Process 0 decremented value: 1 >> Process 0 decremented value: 0 >> Process 0 exiting >> [pmdtest@pmd ~]$ >> >> While the ip addresses 192.168.108* are for ib interface. >> >> [root@compute-01-01 ~]# ifconfig >> eth0 Link encap:Ethernet HWaddr 00:24:E8:59:4C:2A >> inet addr:10.0.0.3 Bcast:10.255.255.255 Mask:255.0.0.0 >> inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:65588 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:18692977 (17.8 MiB) TX bytes:1834122 (1.7 MiB) >> Interrupt:169 Memory:dc000000-dc012100 >> ib0 Link encap:InfiniBand HWaddr >> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 >> inet addr:192.168.108.14 Bcast:192.168.108.255 Mask:255.255.255.0 >> UP BROADCAST MULTICAST MTU:65520 Metric:1 >> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:256 >> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >> >> >> >> So the point is why mpirun is following the ib path while I it has >> been disabled. Possible solutions? >> >> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >>> mpirun complains about the 192.168.108.10 ip address, but ping reports a >>> 10.0.0.8 address >>> >>> is the 192.168.* network a point to point network (for example between a >>> host and a mic) so two nodes >>> cannot ping each other via this address ? >>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of >>> compute-01-06 ? */ >>> >>> could you also run >>> >>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca >>> btl_tcp_if_include 10.0.0.0/8 ring_c >>> >>> and see whether it helps ? >>> >>> >>> On 2014/11/13 16:24, Syed Ahsan Ali wrote: >>>> Same result in both cases >>>> >>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host >>>> compute-01-01,compute-01-06 ring_c >>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>> Process 0 sent to 1 >>>> Process 0 decremented value: 9 >>>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>> connect() to 192.168.108.10 failed: No route to host (113) >>>> >>>> >>>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host >>>> compute-01-01,compute-01-06 ring_c >>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>> Process 0 sent to 1 >>>> Process 0 decremented value: 9 >>>> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>> connect() to 192.168.108.10 failed: No route to host (113) >>>> >>>> >>>> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet >>>> <gilles.gouaillar...@iferc.org> wrote: >>>>> Hi, >>>>> >>>>> it seems you messed up the command line >>>>> >>>>> could you try >>>>> >>>>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c >>>>> >>>>> >>>>> can you also try to run mpirun from a compute node instead of the head >>>>> node ? >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On 2014/11/13 16:07, Syed Ahsan Ali wrote: >>>>>> Here is what I see when disabling openib support.\ >>>>>> >>>>>> >>>>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib >>>>>> compute-01-01,compute-01-06 ring_c >>>>>> ssh: orted: Temporary failure in name resolution >>>>>> ssh: orted: Temporary failure in name resolution >>>>>> -------------------------------------------------------------------------- >>>>>> A daemon (pid 7608) died unexpectedly with status 255 while attempting >>>>>> to launch so we are aborting. >>>>>> >>>>>> While nodes can still ssh each other >>>>>> >>>>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06 >>>>>> Last login: Thu Nov 13 12:05:58 2014 from compute-01-01.private.dns.zone >>>>>> [pmdtest@compute-01-06 ~]$ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali <ahsansha...@gmail.com> >>>>>> wrote: >>>>>>> Hi Jefff >>>>>>> >>>>>>> No firewall is enabled. Running the diagnostics I found that non >>>>>>> communication mpi job is running . While ring_c remains stuck. There >>>>>>> are of course warnings for open fabrics but in my case I an running >>>>>>> application by disabling openib., Please see below >>>>>>> >>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out >>>>>>> -------------------------------------------------------------------------- >>>>>>> WARNING: There is at least one OpenFabrics device found but there are >>>>>>> no active ports detected (or Open MPI was unable to use them). This >>>>>>> is most certainly not what you wanted. Check your cables, subnet >>>>>>> manager configuration, etc. The openib BTL will be ignored for this >>>>>>> job. >>>>>>> Local host: compute-01-01.private.dns.zone >>>>>>> -------------------------------------------------------------------------- >>>>>>> Hello, world, I am 0 of 2 >>>>>>> Hello, world, I am 1 of 2 >>>>>>> [pmd.pakmet.com:06386] 1 more process has sent help message >>>>>>> help-mpi-btl-openib.txt / no active ports found >>>>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to >>>>>>> 0 to see all help / error messages >>>>>>> >>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c >>>>>>> -------------------------------------------------------------------------- >>>>>>> WARNING: There is at least one OpenFabrics device found but there are >>>>>>> no active ports detected (or Open MPI was unable to use them). This >>>>>>> is most certainly not what you wanted. Check your cables, subnet >>>>>>> manager configuration, etc. The openib BTL will be ignored for this >>>>>>> job. >>>>>>> Local host: compute-01-01.private.dns.zone >>>>>>> -------------------------------------------------------------------------- >>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>>>> Process 0 sent to 1 >>>>>>> Process 0 decremented value: 9 >>>>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>> [pmd.pakmet.com:15965] 1 more process has sent help message >>>>>>> help-mpi-btl-openib.txt / no active ports found >>>>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to >>>>>>> 0 to see all help / error messages >>>>>>> <span class="sewh9wyhn1gq30p"><br></span> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres) >>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>> Do you have firewalling enabled on either server? >>>>>>>> >>>>>>>> See this FAQ item: >>>>>>>> >>>>>>>> >>>>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Dear All >>>>>>>>> >>>>>>>>> I need your advice. While trying to run mpirun job across nodes I get >>>>>>>>> following error. It seems that the two nodes i.e, compute-01-01 and >>>>>>>>> compute-01-06 are not able to communicate with each other. While nodes >>>>>>>>> see each other on ping. >>>>>>>>> >>>>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl >>>>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in >>>>>>>>> >>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>> >>>>>>>>> mpirun: killing job... >>>>>>>>> >>>>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01 >>>>>>>>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone >>>>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06 >>>>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data. >>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1 >>>>>>>>> ttl=64 time=0.108 ms >>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2 >>>>>>>>> ttl=64 time=0.088 ms >>>>>>>>> >>>>>>>>> --- compute-01-06.private.dns.zone ping statistics --- >>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms >>>>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms >>>>>>>>> [pmdtest@compute-01-01 ~]$ >>>>>>>>> >>>>>>>>> Thanks in advance. >>>>>>>>> >>>>>>>>> Ahsan >>>>>>>>> _______________________________________________ >>>> _______________________________________________