My email was mixture of error messages/warnings. IB Card on compute-01-10 is faulty on ibstatus.
In ibstat on other nodes as well as on compute-01-15 there are dual ports as I see status of both ports in ibstat. Firewall in not a problem, I am sure about it. How can I check bad ethernet port. I mean I can ping among master and compute nodes. /etc/hosts is ok for name resolution. Thank you very much for responding and helping me out. Ahsan On Mon, Jan 20, 2014 at 9:27 AM, Gustavo Correa <g...@ldeo.columbia.edu>wrote: > Is your IB card in compute-01-10.private.dns.zone working? > Did you check it with ibstat? > > Do you have a dual port IB card in compute-01-15.private.dns.zone? > Did you connect both ports to the same switch on the same subnet? > > TCP "no route to host": > If it is not a firewall problem, could it bad Ethernet port on a node > perhaps? > > Also, if you use host names in your hostfile, I guess they need to be able > to > resolve the names into IP addresses. > Check if your /etc/hosts file, DNS server, or whatever you > use for name resolution, is correct and consistent across the cluster. > > On Jan 19, 2014, at 10:18 PM, Syed Ahsan Ali wrote: > > > I agree with you and still struglling with subnet ID settings because I > couldn't find /var/cache/opensm/opensm.opts file. > > > > Secondly, if OMPI is going for TCP then it should be able to find as > compute nodes are available via ping and ssh > > > > > > On Sun, Jan 19, 2014 at 9:38 PM, Ralph Castain <r...@open-mpi.org> wrote: > > If OMPI finds infiniband support on the node, it will attempt to use it. > In this case, it would appear you have an incorrectly configured IB adaptor > on the node, so you get the additional warning about that fact. > > > > OMPI then falls back to look for another transport, in this case TCP. > However, the TCP transport is unable to create a socket to the remote host. > The most likely cause is a firewall, so you might want to check that and > turn it off. > > > > > > On Jan 19, 2014, at 4:19 AM, Syed Ahsan Ali <ahsansha...@gmail.com> > wrote: > > > >> Dear All > >> > >> I am getting infiniband errors while running mpirun applications on > cluster. I get these errors even when I don't include infiniband usage > flags in mpirun command. Please guide > >> > >> mpirun -np 72 -hostfile hostlist ../bin/regcmMPI regcm.in > >> > >> > -------------------------------------------------------------------------- > >> [[59183,1],24]: A high-performance Open MPI point-to-point messaging > module > >> was unable to find any relevant network interfaces: > >> Module: OpenFabrics (openib) > >> Host: compute-01-10.private.dns.zone > >> > >> Another transport will be used instead, although this may result in > >> lower performance. > >> > -------------------------------------------------------------------------- > >> > -------------------------------------------------------------------------- > >> WARNING: There are more than one active ports on host > 'compute-01-15.private.dns.zone', but the > >> default subnet GID prefix was detected on more than one of these > >> ports. If these ports are connected to different physical IB > >> networks, this configuration will fail in Open MPI. This version of > >> Open MPI requires that every physically separate IB subnet that is > >> used between connected MPI processes must have different subnet ID > >> values. > >> > >> Please see this FAQ entry for more details: > >> > >> > http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid > >> > >> NOTE: You can turn off this warning by setting the MCA parameter > >> btl_openib_warn_default_gid_prefix to 0. > >> > -------------------------------------------------------------------------- > >> > >> This is RegCM trunk > >> SVN Revision: tag 4.3.5.6 compiled at: data : Sep 3 2013 time: > 05:10:53 > >> > >> [pmd.pakmet.com:03309] 15 more processes have sent help message > help-mpi-btl-base.txt / btl:no-nics > >> [pmd.pakmet.com:03309] Set MCA parameter "orte_base_help_aggregate" to > 0 to see all help / error messages > >> [pmd.pakmet.com:03309] 47 more processes have sent help message > help-mpi-btl-openib.txt / default subnet prefix > >> > [compute-01-03.private.dns.zone][[59183,1],1][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.10 failed: No route to host (113) > >> > [compute-01-03.private.dns.zone][[59183,1],2][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.10 failed: No route to host (113) > >> > [compute-01-03.private.dns.zone][[59183,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.10 failed: No route to host (113) > >> > [compute-01-03.private.dns.zone][[59183,1],3][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > [compute-01-03.private.dns.zone][[59183,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.10 failed: No route to host (113) > >> > [compute-01-03.private.dns.zone][[59183,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.10 failed: No route to host (113) > >> connect() to 192.168.108.10 failed: No route to host (113) > >> > [compute-01-03.private.dns.zone][[59183,1],6][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.10 failed: No route to host (113) > >> > [compute-01-03.private.dns.zone][[59183,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.10 failed: No route to host (113) > >> > >> Ahsan > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > -- > > Syed Ahsan Ali Bokhari > > Electronic Engineer (EE) > > > > Research & Development Division > > Pakistan Meteorological Department H-8/4, Islamabad. > > Phone # off +92518358714 > > Cell # +923155145014 > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >