it may seem a silly question, but because you don't show output from the .21 host is that also configured and up from an lnet perspective?
On Mon, Jan 18, 2021 at 11:39 PM Vinícius Ferrão <[email protected]> wrote: > > Hello, > > I’ve been scratching my head for three days now but I cannot do a simple ping > over Infiniband using LNet. To be honest I have no idea of whats may be > happening. LNet over TCP (on ethernet) seems to work fine. The only way LNet > ping works is by pinging itself: > > [root@mds1 ~]# lctl ping 10.148.0.20@o2ib1 > 12345-0@lo > 12345-10.24.2.12@tcp1 > 12345-10.148.0.20@o2ib1 > > Everything else just fails: > > [root@mds1 ~]# lctl ping 10.148.0.21@o2ib1 > failed to ping 10.148.0.21@o2ib1: Input/output error > [root@mds1 ~]# dmesg -T | tail -n 2 > [Tue Jan 19 01:26:01 2021] LNet: > 2424:0:(o2iblnd_cb.c:3405:kiblnd_check_conns()) Timed out tx for > 10.148.0.21@o2ib1: 5095 seconds > [Tue Jan 19 01:26:01 2021] LNetError: > 2362:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET > to 12345-10.148.0.21@o2ib1: -125 > > I can confirm that IPoIB network is working as expected: > > [root@mds1 ~]# ping 10.148.0.21 > PING 10.148.0.21 (10.148.0.21) 56(84) bytes of data. > 64 bytes from 10.148.0.21: icmp_seq=1 ttl=64 time=2.52 ms > 64 bytes from 10.148.0.21: icmp_seq=2 ttl=64 time=0.085 ms > > Configuration seem to match between the two example machines: > > [root@mds1 ~]# ifconfig ib0 | head -n 2 > Infiniband hardware address can be incorrect! Please read BUGS section in > ifconfig(8). > ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520 > inet 10.148.0.20 netmask 255.255.0.0 broadcast 10.148.255.255 > > [root@mds2 ~]# ifconfig ib0 | head -n 2 > Infiniband hardware address can be incorrect! Please read BUGS section in > ifconfig(8). > ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520 > inet 10.148.0.21 netmask 255.255.0.0 broadcast 10.148.255.255 > > Here’s the output of network configuration: > [root@mds1 ~]# lnetctl net show > net: > - net type: lo > local NI(s): > - nid: 0@lo > status: up > - net type: tcp1 > local NI(s): > - nid: 10.24.2.12@tcp1 > status: up > interfaces: > 0: bond0 > - net type: o2ib1 > local NI(s): > - nid: 10.148.0.20@o2ib1 > status: up > interfaces: > 0: ib0 > > Modules seems to be loaded: > [root@mds1 ~]# lsmod | egrep "mlx|mlnx|lnet|rdma|ko2iblnd" > lnet_selftest 274357 0 > ko2iblnd 238469 1 > lnet 595358 4 ko2iblnd,lnet_selftest,ksocklnd > libcfs 415577 4 lnet,ko2iblnd,lnet_selftest,ksocklnd > rdma_ucm 26931 0 > rdma_cm 64252 2 ko2iblnd,rdma_ucm > iw_cm 43918 1 rdma_cm > ib_cm 53015 3 rdma_cm,ib_ucm,ib_ipoib > mlx4_en 142468 0 > mlx4_ib 220791 0 > mlx4_core 361489 2 mlx4_en,mlx4_ib > mlx5_ib 398193 0 > ib_uverbs 134646 3 mlx5_ib,ib_ucm,rdma_ucm > ib_core 379808 11 > rdma_cm,ib_cm,iw_cm,ko2iblnd,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib > mlx5_core 1113637 1 mlx5_ib > mlxfw 18227 1 mlx5_core > devlink 60067 4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core > mlx_compat 47141 15 > rdma_cm,ib_cm,iw_cm,ko2iblnd,mlx4_en,mlx4_ib,mlx5_ib,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib > ptp 23551 3 i40e,mlx4_en,mlx5_core > > Both systems were running CentOS 7.9, Lustre 2.12.6 (IB Branch) and Mellanox > OFED 4.9-2.2.4.0. > > The only error message that I’ve found is the one that I’ve pasted in the > start of this message on dmesg and tem I/O error. > > Any help is greatly appreciated. > Thanks, > Vinícius. > > > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
