Is your IB card in compute-01-10.private.dns.zone working? Did you check it with ibstat?
Do you have a dual port IB card in compute-01-15.private.dns.zone? Did you connect both ports to the same switch on the same subnet? TCP "no route to host": If it is not a firewall problem, could it bad Ethernet port on a node perhaps? Also, if you use host names in your hostfile, I guess they need to be able to resolve the names into IP addresses. Check if your /etc/hosts file, DNS server, or whatever you use for name resolution, is correct and consistent across the cluster. On Jan 19, 2014, at 10:18 PM, Syed Ahsan Ali wrote: > I agree with you and still struglling with subnet ID settings because I > couldn't find /var/cache/opensm/opensm.opts file. > > Secondly, if OMPI is going for TCP then it should be able to find as compute > nodes are available via ping and ssh > > > On Sun, Jan 19, 2014 at 9:38 PM, Ralph Castain <r...@open-mpi.org> wrote: > If OMPI finds infiniband support on the node, it will attempt to use it. In > this case, it would appear you have an incorrectly configured IB adaptor on > the node, so you get the additional warning about that fact. > > OMPI then falls back to look for another transport, in this case TCP. > However, the TCP transport is unable to create a socket to the remote host. > The most likely cause is a firewall, so you might want to check that and turn > it off. > > > On Jan 19, 2014, at 4:19 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote: > >> Dear All >> >> I am getting infiniband errors while running mpirun applications on cluster. >> I get these errors even when I don't include infiniband usage flags in >> mpirun command. Please guide >> >> mpirun -np 72 -hostfile hostlist ../bin/regcmMPI regcm.in >> >> -------------------------------------------------------------------------- >> [[59183,1],24]: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> Module: OpenFabrics (openib) >> Host: compute-01-10.private.dns.zone >> >> Another transport will be used instead, although this may result in >> lower performance. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> WARNING: There are more than one active ports on host >> 'compute-01-15.private.dns.zone', but the >> default subnet GID prefix was detected on more than one of these >> ports. If these ports are connected to different physical IB >> networks, this configuration will fail in Open MPI. This version of >> Open MPI requires that every physically separate IB subnet that is >> used between connected MPI processes must have different subnet ID >> values. >> >> Please see this FAQ entry for more details: >> >> http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid >> >> NOTE: You can turn off this warning by setting the MCA parameter >> btl_openib_warn_default_gid_prefix to 0. >> -------------------------------------------------------------------------- >> >> This is RegCM trunk >> SVN Revision: tag 4.3.5.6 compiled at: data : Sep 3 2013 time: 05:10:53 >> >> [pmd.pakmet.com:03309] 15 more processes have sent help message >> help-mpi-btl-base.txt / btl:no-nics >> [pmd.pakmet.com:03309] Set MCA parameter "orte_base_help_aggregate" to 0 to >> see all help / error messages >> [pmd.pakmet.com:03309] 47 more processes have sent help message >> help-mpi-btl-openib.txt / default subnet prefix >> [compute-01-03.private.dns.zone][[59183,1],1][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> [compute-01-03.private.dns.zone][[59183,1],2][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> [compute-01-03.private.dns.zone][[59183,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> [compute-01-03.private.dns.zone][[59183,1],3][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> >> [compute-01-03.private.dns.zone][[59183,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> [compute-01-03.private.dns.zone][[59183,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> connect() to 192.168.108.10 failed: No route to host (113) >> [compute-01-03.private.dns.zone][[59183,1],6][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> [compute-01-03.private.dns.zone][[59183,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.108.10 failed: No route to host (113) >> >> Ahsan >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Syed Ahsan Ali Bokhari > Electronic Engineer (EE) > > Research & Development Division > Pakistan Meteorological Department H-8/4, Islamabad. > Phone # off +92518358714 > Cell # +923155145014 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users