Dear All,

    Need some advice on the following situation: one of my servers (Lustre client only) could no longer connect to the Lustre server. Suspecting some problem on the LNET configuration, but I am too new to Lustre and does not have more clue on how to troubleshoot it.

Kernel version: Linux 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Lustre version: 2.14.0 (pulled from git)
Lustre debs built with GCC 9.3.0 on the server.

Modprobe not cleanly complete as static lnet configuration does not work:
# modprobe -v lustre
insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/libcfs.ko
insmod /lib/modules/5.4.0-65-generic/updates/kernel/net/lnet.ko networks="o2ib0(ibp225s0f0)"
insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/obdclass.ko
insmod /lib/modules/5.4.0-65-generic/updates/kernel/fs/ptlrpc.ko
modprobe: ERROR: could not insert 'lustre': Network is down

    So resort to try dynamic lnet configuration:

# lctl net up
LNET configure error 100: Network is down

# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up

# lnetctl net add --net o2ib0 --if ibp225s0f0"
add:
    - net:
          errno: -100
          descr: "cannot add network: Network is down"

   Having these error messages in dmesg after the above "lnetctl net add" command [265979.237735] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate()) lnet: Ignoring interface enxeeeb676d0232: it's down [265979.237738] LNet: 3893180:0:(config.c:1564:lnet_inet_enumerate()) Skipped 9 previous similar messages [265979.238395] LNetError: 3893180:0:(o2iblnd.c:2655:kiblnd_hdev_get_attr()) Invalid mr size: 0x1000000 [265979.267372] LNetError: 3893180:0:(o2iblnd.c:2869:kiblnd_dev_failover()) Can't get device attributes: -22 [265979.298129] LNetError: 3893180:0:(o2iblnd.c:3353:kiblnd_startup()) ko2iblnd: Can't initialize device: rc = -22
[265980.353643] LNetError: 105-4: Error -100 starting up LNI o2ib

Initial Diagnosis:
# ip link show ibp225s0f0
41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 256     link/infiniband 00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

# ip address show ibp225s0f0
41: ibp225s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256     link/infiniband 00:00:11:08:fe:80:00:00:00:00:00:00:0c:42:a1:03:00:79:99:1c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.10.10.3/16 brd 10.10.255.255 scope global ibp225s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::e42:a103:79:991c/64 scope link
       valid_lft forever preferred_lft forever

# ifconfig ibp225s0f0
ibp225s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 10.10.10.3  netmask 255.255.0.0  broadcast 10.10.255.255
        inet6 fe80::e42:a103:79:991c  prefixlen 64  scopeid 0x20<link>
        unspec 00-00-11-08-FE-80-00-00-00-00-00-00-00-00-00-00 txqueuelen 256  (UNSPEC)
        RX packets 14363998  bytes 1440476592 (1.4 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 88  bytes 6648 (6.6 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# lsmod | grep ib
ko2iblnd              233472  0
lnet                  552960  3 ko2iblnd,obdclass
libcfs                487424  3 lnet,ko2iblnd,obdclass
ib_umad                28672  0
ib_ipoib              110592  0
rdma_cm                61440  2 ko2iblnd,rdma_ucm
ib_cm                  57344  2 rdma_cm,ib_ipoib
mlx5_ib               307200  0
mlx_compat             65536  1 ko2iblnd
ib_uverbs             126976  2 rdma_ucm,mlx5_ib
ib_core               311296  9 rdma_cm,ib_ipoib,ko2iblnd,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx5_core             933888  1 mlx5_ib
libcrc32c              16384  4 nf_conntrack,nf_nat,btrfs,raid456

    Also tested ping, ibping and rping, all passed. I have no clue what's happening as the server was able to connect to Lustre.

  Regards,
Bill Yau
University of Hong Kong
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to