You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca btl_tcp_if_include 10.0.0.0/8 ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 1 exiting Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting [pmdtest@pmd ~]$
While the ip addresses 192.168.108* are for ib interface. [root@compute-01-01 ~]# ifconfig eth0 Link encap:Ethernet HWaddr 00:24:E8:59:4C:2A inet addr:10.0.0.3 Bcast:10.255.255.255 Mask:255.0.0.0 inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:65588 errors:0 dropped:0 overruns:0 frame:0 TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:18692977 (17.8 MiB) TX bytes:1834122 (1.7 MiB) Interrupt:169 Memory:dc000000-dc012100 ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.108.14 Bcast:192.168.108.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) So the point is why mpirun is following the ib path while I it has been disabled. Possible solutions? On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > mpirun complains about the 192.168.108.10 ip address, but ping reports a > 10.0.0.8 address > > is the 192.168.* network a point to point network (for example between a > host and a mic) so two nodes > cannot ping each other via this address ? > /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of > compute-01-06 ? */ > > could you also run > > mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca > btl_tcp_if_include 10.0.0.0/8 ring_c > > and see whether it helps ? > > > On 2014/11/13 16:24, Syed Ahsan Ali wrote: >> Same result in both cases >> >> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host >> compute-01-01,compute-01-06 ring_c >> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >> Process 0 sent to 1 >> Process 0 decremented value: 9 >> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> >> >> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host >> compute-01-01,compute-01-06 ring_c >> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >> Process 0 sent to 1 >> Process 0 decremented value: 9 >> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> >> >> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >>> Hi, >>> >>> it seems you messed up the command line >>> >>> could you try >>> >>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c >>> >>> >>> can you also try to run mpirun from a compute node instead of the head >>> node ? >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/11/13 16:07, Syed Ahsan Ali wrote: >>>> Here is what I see when disabling openib support.\ >>>> >>>> >>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib >>>> compute-01-01,compute-01-06 ring_c >>>> ssh: orted: Temporary failure in name resolution >>>> ssh: orted: Temporary failure in name resolution >>>> -------------------------------------------------------------------------- >>>> A daemon (pid 7608) died unexpectedly with status 255 while attempting >>>> to launch so we are aborting. >>>> >>>> While nodes can still ssh each other >>>> >>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06 >>>> Last login: Thu Nov 13 12:05:58 2014 from compute-01-01.private.dns.zone >>>> [pmdtest@compute-01-06 ~]$ >>>> >>>> >>>> >>>> >>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali <ahsansha...@gmail.com> >>>> wrote: >>>>> Hi Jefff >>>>> >>>>> No firewall is enabled. Running the diagnostics I found that non >>>>> communication mpi job is running . While ring_c remains stuck. There >>>>> are of course warnings for open fabrics but in my case I an running >>>>> application by disabling openib., Please see below >>>>> >>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out >>>>> -------------------------------------------------------------------------- >>>>> WARNING: There is at least one OpenFabrics device found but there are >>>>> no active ports detected (or Open MPI was unable to use them). This >>>>> is most certainly not what you wanted. Check your cables, subnet >>>>> manager configuration, etc. The openib BTL will be ignored for this >>>>> job. >>>>> Local host: compute-01-01.private.dns.zone >>>>> -------------------------------------------------------------------------- >>>>> Hello, world, I am 0 of 2 >>>>> Hello, world, I am 1 of 2 >>>>> [pmd.pakmet.com:06386] 1 more process has sent help message >>>>> help-mpi-btl-openib.txt / no active ports found >>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to >>>>> 0 to see all help / error messages >>>>> >>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c >>>>> -------------------------------------------------------------------------- >>>>> WARNING: There is at least one OpenFabrics device found but there are >>>>> no active ports detected (or Open MPI was unable to use them). This >>>>> is most certainly not what you wanted. Check your cables, subnet >>>>> manager configuration, etc. The openib BTL will be ignored for this >>>>> job. >>>>> Local host: compute-01-01.private.dns.zone >>>>> -------------------------------------------------------------------------- >>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>>> Process 0 sent to 1 >>>>> Process 0 decremented value: 9 >>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>> [pmd.pakmet.com:15965] 1 more process has sent help message >>>>> help-mpi-btl-openib.txt / no active ports found >>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to >>>>> 0 to see all help / error messages >>>>> <span class="sewh9wyhn1gq30p"><br></span> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres) >>>>> <jsquy...@cisco.com> wrote: >>>>>> Do you have firewalling enabled on either server? >>>>>> >>>>>> See this FAQ item: >>>>>> >>>>>> >>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems >>>>>> >>>>>> >>>>>> >>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Dear All >>>>>>> >>>>>>> I need your advice. While trying to run mpirun job across nodes I get >>>>>>> following error. It seems that the two nodes i.e, compute-01-01 and >>>>>>> compute-01-06 are not able to communicate with each other. While nodes >>>>>>> see each other on ping. >>>>>>> >>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl >>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in >>>>>>> >>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>> connect() to 192.168.108.14 failed: No route to host (113) >>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>> connect() to 192.168.108.10 failed: No route to host (113) >>>>>>> >>>>>>> mpirun: killing job... >>>>>>> >>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01 >>>>>>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone >>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06 >>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data. >>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1 >>>>>>> ttl=64 time=0.108 ms >>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2 >>>>>>> ttl=64 time=0.088 ms >>>>>>> >>>>>>> --- compute-01-06.private.dns.zone ping statistics --- >>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms >>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms >>>>>>> [pmdtest@compute-01-01 ~]$ >>>>>>> >>>>>>> Thanks in advance. >>>>>>> >>>>>>> Ahsan >>>>>>> _______________________________________________ >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25788.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25789.php