3 things.... Can you send your /etc/lnet.conf file Can you also send /etc/modprobe.d/lnet.conf and does a systemctl restart lnet produce an error?
Sid On Fri, Apr 30, 2021 at 6:27 AM <[email protected]> wrote: > Send lustre-discuss mailing list submissions to > [email protected] > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > or, via email, send a message with subject or body 'help' to > [email protected] > > You can reach the person managing the list at > [email protected] > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of lustre-discuss digest..." > Today's Topics: > > 1. Lustre client LNET problem from a novice (Yau Hing Tuen, Bill) > > > > ---------- Forwarded message ---------- > From: "Yau Hing Tuen, Bill" <[email protected]> > To: [email protected] > Cc: > Bcc: > Date: Thu, 29 Apr 2021 15:23:51 +0800 > Subject: [lustre-discuss] Lustre client LNET problem from a novice > Dear All, > > Need some advice on the following situation: one of my servers > (Lustre client only) could no longer connect to the Lustre server. > Suspecting some problem on the LNET configuration, but I am too new to > Lustre and does not have more clue on how to troubleshoot it. > > Kernel version: Linux 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 > 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > Lustre version: 2.14.0 (pulled from git) > Lustre debs built with GCC 9.3.0 on the server. > > Modprobe not cleanly complete as static lnet configuration does not work: > # modprobe -v lustre > insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/libcfs.ko > insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/lnet.ko > networks="o2ib0(ibp225s0f0)" > insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/obdclass.ko > insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/ptlrpc.ko > modprobe: ERROR: could not insert 'lustre': Network is down > > So resort to try dynamic lnet configuration: > > # lctl net up > LNET configure error 100: Network is down > > # lnetctl net show > net: > - net type: lo > local NI(s): > - nid: 0@lo > status: up > > # lnetctl net add --net o2ib0 --if ibp225s0f0" > add: > - net: > errno: -100 > descr: "cannot add network: Network is down" > > Having these error messages in dmesg after the above "lnetctl net > add" command > [265979.237735] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate()) > lnet: Ignoring interface enxeeeb676d0232: it's down > [265979.237738] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate()) > Skipped 9 previous similar messages > [265979.238395] LNetError: > 3893180:0:(o2iblnd.c:2655:kiblnd_hdev_get_attr()) Invalid mr size: > 0x1000000 > [265979.267372] LNetError: > 3893180:0:(o2iblnd.c:2869:kiblnd_dev_failover()) Can't get device > attributes: -22 > [265979.298129] LNetError: 3893180:0:(o2iblnd.c:3353:kiblnd_startup()) > ko2iblnd: Can't initialize device: rc = -22 > [265980.353643] LNetError: 105-4: Error -100 starting up LNI o2ib > > Initial Diagnosis: > # ip link show ibp225s0f0 > 41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq > state UP mode DEFAULT group default qlen 256 > link/infiniband > 00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > > # ip address show ibp225s0f0 > 41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq > state UP group default qlen 256 > link/infiniband > 00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > inet 10.10.10.3/16 brd 10.10.255.255 scope global ibp225s0f0 > valid_lft forever preferred_lft forever > inet6 fe80::e42:a103:79:991c/64 scope link > valid_lft forever preferred_lft forever > > # ifconfig ibp225s0f0 > ibp225s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 > inet 10.10.10.3 netmask 255.255.0.0 broadcast 10.10.255.255 > inet6 fe80::e42:a103:79:991c prefixlen 64 scopeid 0x20<link> > unspec 00-00-11-08-FE-80-00-00-00-00-00-00-00-00-00-00 > txqueuelen 256 (UNSPEC) > RX packets 14363998 bytes 1440476592 (1.4 GB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 88 bytes 6648 (6.6 KB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > # lsmod | grep ib > ko2iblnd 233472 0 > lnet 552960 3 ko2iblnd,obdclass > libcfs 487424 3 lnet,ko2iblnd,obdclass > ib_umad 28672 0 > ib_ipoib 110592 0 > rdma_cm 61440 2 ko2iblnd,rdma_ucm > ib_cm 57344 2 rdma_cm,ib_ipoib > mlx5_ib 307200 0 > mlx_compat 65536 1 ko2iblnd > ib_uverbs 126976 2 rdma_ucm,mlx5_ib > ib_core 311296 9 > rdma_cm,ib_ipoib,ko2iblnd,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm > mlx5_core 933888 1 mlx5_ib > libcrc32c 16384 4 nf_conntrack,nf_nat,btrfs,raid456 > > Also tested ping, ibping and rping, all passed. I have no clue > what's happening as the server was able to connect to Lustre. > > Regards, > Bill Yau > University of Hong Kong > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
