This is really weird ? is the loopback interface up and running on both nodes and with the same ip ?
can you run on both compute nodes ? netstat -nr On 2014/11/13 16:50, Syed Ahsan Ali wrote: > Now it looks through the loopback address > > [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca > btl_tcp_if_exclude ib0 ring_c > Process 0 sending 10 to 1, tag 201 (2 processes in ring) > [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 127.0.0.1 failed: Connection refused (111) > Process 0 sending 10 to 1, tag 201 (2 processes in ring) > [pmd.pakmet.com:30867] 1 more process has sent help message > help-mpi-btl-openib.txt / no active ports found > [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to > 0 to see all help / error messages > > > > On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: >> --mca btl ^openib >> disables the openib btl, which is native infiniband only. >> >> ib0 is treated as any TCP interface and then handled by the tcp btl >> >> an other option is you to use >> --mca btl_tcp_if_exclude ib0 >> >> On 2014/11/13 16:43, Syed Ahsan Ali wrote: >>> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$ >>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca >>> btl_tcp_if_include 10.0.0.0/8 ring_c >>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>> Process 0 sent to 1 >>> Process 0 decremented value: 9 >>> Process 0 decremented value: 8 >>> Process 0 decremented value: 7 >>> Process 0 decremented value: 6 >>> Process 1 exiting >>> Process 0 decremented value: 5 >>> Process 0 decremented value: 4 >>> Process 0 decremented value: 3 >>> Process 0 decremented value: 2 >>> Process 0 decremented value: 1 >>> Process 0 decremented value: 0 >>> Process 0 exiting >>> [pmdtest@pmd ~]$ >>> >>> While the ip addresses 192.168.108* are for ib interface. >>> >>> [root@compute-01-01 ~]# ifconfig >>> eth0 Link encap:Ethernet HWaddr 00:24:E8:59:4C:2A >>> inet addr:10.0.0.3 Bcast:10.255.255.255 Mask:255.0.0.0 >>> inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link >>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>> RX packets:65588 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:1000 >>> RX bytes:18692977 (17.8 MiB) TX bytes:1834122 (1.7 MiB) >>> Interrupt:169 Memory:dc000000-dc012100 >>> ib0 Link encap:InfiniBand HWaddr >>> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 >>> inet addr:192.168.108.14 Bcast:192.168.108.255 >>> Mask:255.255.255.0 >>> UP BROADCAST MULTICAST MTU:65520 Metric:1 >>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:256 >>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >>> >>> >>> >>> So the point is why mpirun is following the ib path while I it has >>> been disabled. Possible solutions? >>> >>> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet >>> <gilles.gouaillar...@iferc.org> wrote: >>>> mpirun complains about the 192.168.108.10 ip address, but ping reports a >>>> 10.0.0.8 address >>>> >>>> is the 192.168.* network a point to point network (for example between a >>>> host and a mic) so two nodes >>>> cannot ping each other via this address ? >>>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of >>>> compute-01-06 ? */ >>>> >>>> could you also run >>>> >>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca >>>> btl_tcp_if_include 10.0.0.0/8 ring_c >>>> >>>> and see whether it helps ? >>>> >>>> >>>> On 2014/11/13 16:24, Syed Ahsan Ali wrote: >>>>> Same result in both cases >>>>> >>>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host >>>>> compute-01-01,compute-01-06 ring_c >>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>> Process 0 sent to 1 >>>>> Process 0 decremented value: 9 >>>>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>> >>>>> >>>>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host >>>>> compute-01-01,compute-01-06 ring_c >>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>> Process 0 sent to 1 >>>>> Process 0 decremented value: 9 >>>>> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>> >>>>> >>>>> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet >>>>> <gilles.gouaillar...@iferc.org> wrote: >>>>>> Hi, >>>>>> >>>>>> it seems you messed up the command line >>>>>> >>>>>> could you try >>>>>> >>>>>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c >>>>>> >>>>>> >>>>>> can you also try to run mpirun from a compute node instead of the head >>>>>> node ? >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> On 2014/11/13 16:07, Syed Ahsan Ali wrote: >>>>>>> Here is what I see when disabling openib support.\ >>>>>>> >>>>>>> >>>>>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib >>>>>>> compute-01-01,compute-01-06 ring_c >>>>>>> ssh: orted: Temporary failure in name resolution >>>>>>> ssh: orted: Temporary failure in name resolution >>>>>>> -------------------------------------------------------------------------- >>>>>>> A daemon (pid 7608) died unexpectedly with status 255 while attempting >>>>>>> to launch so we are aborting. >>>>>>> >>>>>>> While nodes can still ssh each other >>>>>>> >>>>>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06 >>>>>>> Last login: Thu Nov 13 12:05:58 2014 from compute-01-01.private.dns.zone >>>>>>> [pmdtest@compute-01-06 ~]$ >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali >>>>>>> <ahsansha...@gmail.com> wrote: >>>>>>>> Hi Jefff >>>>>>>> >>>>>>>> No firewall is enabled. Running the diagnostics I found that non >>>>>>>> communication mpi job is running . While ring_c remains stuck. There >>>>>>>> are of course warnings for open fabrics but in my case I an running >>>>>>>> application by disabling openib., Please see below >>>>>>>> >>>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> WARNING: There is at least one OpenFabrics device found but there are >>>>>>>> no active ports detected (or Open MPI was unable to use them). This >>>>>>>> is most certainly not what you wanted. Check your cables, subnet >>>>>>>> manager configuration, etc. The openib BTL will be ignored for this >>>>>>>> job. >>>>>>>> Local host: compute-01-01.private.dns.zone >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> Hello, world, I am 0 of 2 >>>>>>>> Hello, world, I am 1 of 2 >>>>>>>> [pmd.pakmet.com:06386] 1 more process has sent help message >>>>>>>> help-mpi-btl-openib.txt / no active ports found >>>>>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to >>>>>>>> 0 to see all help / error messages >>>>>>>> >>>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> WARNING: There is at least one OpenFabrics device found but there are >>>>>>>> no active ports detected (or Open MPI was unable to use them). This >>>>>>>> is most certainly not what you wanted. Check your cables, subnet >>>>>>>> manager configuration, etc. The openib BTL will be ignored for this >>>>>>>> job. >>>>>>>> Local host: compute-01-01.private.dns.zone >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>>>>> Process 0 sent to 1 >>>>>>>> Process 0 decremented value: 9 >>>>>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>> [pmd.pakmet.com:15965] 1 more process has sent help message >>>>>>>> help-mpi-btl-openib.txt / no active ports found >>>>>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to >>>>>>>> 0 to see all help / error messages >>>>>>>> <span class="sewh9wyhn1gq30p"><br></span> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres) >>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>> Do you have firewalling enabled on either server? >>>>>>>>> >>>>>>>>> See this FAQ item: >>>>>>>>> >>>>>>>>> >>>>>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Dear All >>>>>>>>>> >>>>>>>>>> I need your advice. While trying to run mpirun job across nodes I get >>>>>>>>>> following error. It seems that the two nodes i.e, compute-01-01 and >>>>>>>>>> compute-01-06 are not able to communicate with each other. While >>>>>>>>>> nodes >>>>>>>>>> see each other on ping. >>>>>>>>>> >>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl >>>>>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in >>>>>>>>>> >>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>>>>> >>>>>>>>>> mpirun: killing job... >>>>>>>>>> >>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01 >>>>>>>>>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone >>>>>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06 >>>>>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data. >>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1 >>>>>>>>>> ttl=64 time=0.108 ms >>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2 >>>>>>>>>> ttl=64 time=0.088 ms >>>>>>>>>> >>>>>>>>>> --- compute-01-06.private.dns.zone ping statistics --- >>>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms >>>>>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms >>>>>>>>>> [pmdtest@compute-01-01 ~]$ >>>>>>>>>> >>>>>>>>>> Thanks in advance. >>>>>>>>>> >>>>>>>>>> Ahsan >>>>>>>>>> _______________________________________________ >>>>> _______________________________________________ > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25792.php