Dear Pasha The ibstatus is not of two different machines it is of the same machine. There are two infiband ports showing up on all nodes. I checked on all the nodes that one of the port in always in INIT status and other one active. Now please see below the ibstatus of the problem causing node (compute-01-01). Its one port is down. May be this is the reason for error?. Is it a physical port?
[root@compute-01-01 ~]# ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0018:8b90:97fe:94fe base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 4: PortConfigurationTraining rate: 10 Gb/sec (4X) Infiniband device 'mlx4_0' port 2 status: default gid: fe80:0000:0000:0000:0018:8b90:97fe:94ff base lid: 0x29 sm lid: 0x15 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) On Tue, Jul 22, 2014 at 6:50 PM, Shamis, Pavel <sham...@ornl.gov> wrote: > Hmm, this does not make sense. > Your copy-n-paste shows that both machines (00 and 01) have the same > guid/lid (sort of equivalent of mac address in ethernet world). > As you can guess these two can not be identical for two different machines > (unless you moved the card around). > > Best, > Pasha > > On Jul 21, 2014, at 11:26 PM, Syed Ahsan Ali <ahsansha...@gmail.com > <mailto:ahsansha...@gmail.com>> wrote: > > Yes I had checked running mpirun on all nodes one by one to see the > problematic one. I had already mentioned that compute-01-01 is causing > problem, when I remove it from the hostlist mpirun works fine. Here is > ibstatus of compute-01-01. > > Infiniband device 'mlx4_0' port 1 status: > default gid: fe80:0000:0000:0000:0024:e890:97ff:1c61 > base lid: 0x5 > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > Infiniband device 'mlx4_0' port 2 status: > default gid: fe80:0000:0000:0000:0024:e890:97ff:1c62 > base lid: 0x0 > sm lid: 0x0 > state: 2: INIT > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > > > On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel <sham...@ornl.gov<mailto: > sham...@ornl.gov>> wrote: > > You have to check the ports states on *all* nodes in the > run/job/submission. Checking on a single node is not enough. > My guess is the 01-00 tries to connect 01-01 and the ports are down on > 01-01. > > You may disable support for infiniband by adding --mca btl ^openib. > > Best, > Pavel (Pasha) Shamis > --- > Computer Science Research Group > Computer Science and Math Division > Oak Ridge National Laboratory > > > > > > > On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali <ahsansha...@gmail.com<mailto: > ahsansha...@gmail.com><mailto:ahsansha...@gmail.com<mailto: > ahsansha...@gmail.com>>> wrote: > > Dear All > > I need your help to solve this cluster related issue causing mpirun > malfunction. I get following warning for some of the nodes and then the > route failure message comes causing failure to mpirun. > > > WARNING: There is at least one OpenFabrics device found but there are no > active ports detected (or Open MPI was unable to use them). This > is most certainly not what you wanted. Check your cables, subnet > manager configuration, etc. The openib BTL will be ignored for this > job. > Local host: compute-01-01.private.dns.zone > -------------------------------------------------------------------------- > SETUP OF THE LM > INITIALIZATIONS > INPUT OF THE NAMELISTS > [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>< > http://pmd.pakmet.com:30198/>] 7 more processes have sent help message > help-mpi-btl-openib.txt / no active ports found > [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>< > http://pmd.pakmet.com:30198/>] Set MCA parameter > "orte_base_help_aggregate" to 0 to see all help / error messages > > [compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.14 failed: No route to host (113) > [compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.14 failed: No route to host (113) > My questions are. > I don't include flags for running openmpi over infiniband then why it > still gives warning. If the infiniband ports are not active then it should > start the job over gigabit ethernet of cluster. Why it is unable to find > the route while the node can be pinged and ssh from other nodes and master > node as well. > The ibstatus of the above node (for which I was getting error) shows that > both ports are up. What is causing error then? > > [root@compute-01-00 ~]# ibstatus > Infiniband device 'mlx4_0' port 1 status: > default gid: fe80:0000:0000:0000:0024:e890:97ff:1c61 > base lid: 0x5 > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > Infiniband device 'mlx4_0' port 2 status: > default gid: fe80:0000:0000:0000:0024:e890:97ff:1c62 > base lid: 0x0 > sm lid: 0x0 > state: 2: INIT > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > > > Thank you in advance for your guidance and support. > > Regards > > -- > Ahsan > _______________________________________________ > users mailing list > us...@open-mpi.org<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org > <mailto:us...@open-mpi.org>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24833.php > > _______________________________________________ > users mailing list > us...@open-mpi.org<mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24835.php > > > > -- > Syed Ahsan Ali Bokhari > Electronic Engineer (EE) > > Research & Development Division > Pakistan Meteorological Department H-8/4, Islamabad. > Phone # off +92518358714 > Cell # +923155145014 > _______________________________________________ > users mailing list > us...@open-mpi.org<mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24841.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24845.php > -- Syed Ahsan Ali Bokhari Electronic Engineer (EE) Research & Development Division Pakistan Meteorological Department H-8/4, Islamabad. Phone # off +92518358714 Cell # +923155145014