I don't see it running [pmdtest@compute-01-01 ~]$ netstat -nr Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 192.168.108.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ib0 239.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 10.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 0.0.0.0 10.0.0.1 0.0.0.0 UG 0 0 0 eth0 [pmdtest@compute-01-01 ~]$ exit logout Connection to compute-01-01 closed. [pmdtest@pmd ~]$ ssh compute-01-06 Last login: Thu Nov 13 12:06:14 2014 from compute-01-01.private.dns.zone [pmdtest@compute-01-06 ~]$ netstat -nr Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 192.168.108.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ib0 239.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 10.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 0.0.0.0 10.0.0.1 0.0.0.0 UG 0 0 0 eth0 [pmdtest@compute-01-06 ~]$ <span class="sewlz7y85hpkn4b"><br></span>
On Thu, Nov 13, 2014 at 12:56 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > This is really weird ? > > is the loopback interface up and running on both nodes and with the same > ip ? > > can you run on both compute nodes ? > netstat -nr > > > On 2014/11/13 16:50, Syed Ahsan Ali wrote: >> Now it looks through the loopback address >> >> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca >> btl_tcp_if_exclude ib0 ring_c >> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >> [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 127.0.0.1 failed: Connection refused (111) >> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >> [pmd.pakmet.com:30867] 1 more process has sent help message >> help-mpi-btl-openib.txt / no active ports found >> [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to >> 0 to see all help / error messages >> >> >> >> On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >>> --mca btl ^openib >>> disables the openib btl, which is native infiniband only. >>> >>> ib0 is treated as any TCP interface and then handled by the tcp btl >>> >>> an other option is you to use >>> --mca btl_tcp_if_exclude ib0 >>> >>> On 2014/11/13 16:43, Syed Ahsan Ali wrote: >>>> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$ >>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca >>>> btl_tcp_if_include 10.0.0.0/8 ring_c >>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>> Process 0 sent to 1 >>>> Process 0 decremented value: 9 >>>> Process 0 decremented value: 8 >>>> Process 0 decremented value: 7 >>>> Process 0 decremented value: 6 >>>> Process 1 exiting >>>> Process 0 decremented value: 5 >>>> Process 0 decremented value: 4 >>>> Process 0 decremented value: 3 >>>> Process 0 decremented value: 2 >>>> Process 0 decremented value: 1 >>>> Process 0 decremented value: 0 >>>> Process 0 exiting >>>> [pmdtest@pmd ~]$ >>>> >>>> While the ip addresses 192.168.108* are for ib interface. >>>> >>>> [root@compute-01-01 ~]# ifconfig >>>> eth0 Link encap:Ethernet HWaddr 00:24:E8:59:4C:2A >>>> inet addr:10.0.0.3 Bcast:10.255.255.255 Mask:255.0.0.0 >>>> inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link >>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>>> RX packets:65588 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:1000 >>>> RX bytes:18692977 (17.8 MiB) TX bytes:1834122 (1.7 MiB) >>>> Interrupt:169 Memory:dc000000-dc012100 >>>> ib0 Link encap:InfiniBand HWaddr >>>> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 >>>> inet addr:192.168.108.14 Bcast:192.168.108.255 >>>> Mask:255.255.255.0 >>>> UP BROADCAST MULTICAST MTU:65520 Metric:1 >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:256 >>>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >>>> >>>> >>>> >>>> So the point is why mpirun is following the ib path while I it has >>>> been disabled. Possible solutions? >>>> >>>> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet >>>> <gilles.gouaillar...@iferc.org> wrote: >>>>> mpirun complains about the 192.168.108.10 ip address, but ping reports a >>>>> 10.0.0.8 address >>>>> >>>>> is the 192.168.* network a point to point network (for example between a >>>>> host and a mic) so two nodes >>>>> cannot ping each other via this address ? >>>>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of >>>>> compute-01-06 ? */ >>>>> >>>>> could you also run >>>>> >>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca >>>>> btl_tcp_if_include 10.0.0.0/8 ring_c >>>>> >>>>> and see whether it helps ? >>>>> >>>>> >>>>> On 2014/11/13 16:24, Syed Ahsan Ali wrote: >>>>>> Same result in both cases >>>>>> >>>>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host >>>>>> compute-01-01,compute-01-06 ring_c >>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>>> Process 0 sent to 1 >>>>>> Process 0 decremented value: 9 >>>>>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>> >>>>>> >>>>>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host >>>>>> compute-01-01,compute-01-06 ring_c >>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>>> Process 0 sent to 1 >>>>>> Process 0 decremented value: 9 >>>>>> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>> >>>>>> >>>>>> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet >>>>>> <gilles.gouaillar...@iferc.org> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> it seems you messed up the command line >>>>>>> >>>>>>> could you try >>>>>>> >>>>>>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c >>>>>>> >>>>>>> >>>>>>> can you also try to run mpirun from a compute node instead of the head >>>>>>> node ? >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Gilles >>>>>>> >>>>>>> On 2014/11/13 16:07, Syed Ahsan Ali wrote: >>>>>>>> Here is what I see when disabling openib support.\ >>>>>>>> >>>>>>>> >>>>>>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib >>>>>>>> compute-01-01,compute-01-06 ring_c >>>>>>>> ssh: orted: Temporary failure in name resolution >>>>>>>> ssh: orted: Temporary failure in name resolution >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> A daemon (pid 7608) died unexpectedly with status 255 while attempting >>>>>>>> to launch so we are aborting. >>>>>>>> >>>>>>>> While nodes can still ssh each other >>>>>>>> >>>>>>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06 >>>>>>>> Last login: Thu Nov 13 12:05:58 2014 from >>>>>>>> compute-01-01.private.dns.zone >>>>>>>> [pmdtest@compute-01-06 ~]$ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali >>>>>>>> <ahsansha...@gmail.com> wrote: >>>>>>>>> Hi Jefff >>>>>>>>> >>>>>>>>> No firewall is enabled. Running the diagnostics I found that non >>>>>>>>> communication mpi job is running . While ring_c remains stuck. There >>>>>>>>> are of course warnings for open fabrics but in my case I an running >>>>>>>>> application by disabling openib., Please see below >>>>>>>>> >>>>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 >>>>>>>>> hello_c.out >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> WARNING: There is at least one OpenFabrics device found but there are >>>>>>>>> no active ports detected (or Open MPI was unable to use them). This >>>>>>>>> is most certainly not what you wanted. Check your cables, subnet >>>>>>>>> manager configuration, etc. The openib BTL will be ignored for this >>>>>>>>> job. >>>>>>>>> Local host: compute-01-01.private.dns.zone >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> Hello, world, I am 0 of 2 >>>>>>>>> Hello, world, I am 1 of 2 >>>>>>>>> [pmd.pakmet.com:06386] 1 more process has sent help message >>>>>>>>> help-mpi-btl-openib.txt / no active ports found >>>>>>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to >>>>>>>>> 0 to see all help / error messages >>>>>>>>> >>>>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> WARNING: There is at least one OpenFabrics device found but there are >>>>>>>>> no active ports detected (or Open MPI was unable to use them). This >>>>>>>>> is most certainly not what you wanted. Check your cables, subnet >>>>>>>>> manager configuration, etc. The openib BTL will be ignored for this >>>>>>>>> job. >>>>>>>>> Local host: compute-01-01.private.dns.zone >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>>>>>> Process 0 sent to 1 >>>>>>>>> Process 0 decremented value: 9 >>>>>>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>> [pmd.pakmet.com:15965] 1 more process has sent help message >>>>>>>>> help-mpi-btl-openib.txt / no active ports found >>>>>>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to >>>>>>>>> 0 to see all help / error messages >>>>>>>>> <span class="sewh9wyhn1gq30p"><br></span> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres) >>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>> Do you have firewalling enabled on either server? >>>>>>>>>> >>>>>>>>>> See this FAQ item: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Dear All >>>>>>>>>>> >>>>>>>>>>> I need your advice. While trying to run mpirun job across nodes I >>>>>>>>>>> get >>>>>>>>>>> following error. It seems that the two nodes i.e, compute-01-01 and >>>>>>>>>>> compute-01-06 are not able to communicate with each other. While >>>>>>>>>>> nodes >>>>>>>>>>> see each other on ping. >>>>>>>>>>> >>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl >>>>>>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in >>>>>>>>>>> >>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>>>> >>>>>>>>>>> mpirun: killing job... >>>>>>>>>>> >>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01 >>>>>>>>>>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone >>>>>>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06 >>>>>>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data. >>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1 >>>>>>>>>>> ttl=64 time=0.108 ms >>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2 >>>>>>>>>>> ttl=64 time=0.088 ms >>>>>>>>>>> >>>>>>>>>>> --- compute-01-06.private.dns.zone ping statistics --- >>>>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms >>>>>>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms >>>>>>>>>>> [pmdtest@compute-01-01 ~]$ >>>>>>>>>>> >>>>>>>>>>> Thanks in advance. >>>>>>>>>>> >>>>>>>>>>> Ahsan >>>>>>>>>>> _______________________________________________ >>>>>> _______________________________________________ >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25792.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25794.php -- Syed Ahsan Ali Bokhari Electronic Engineer (EE) Research & Development Division Pakistan Meteorological Department H-8/4, Islamabad. Phone # off +92518358714 Cell # +923155145014