Good Afternoon,
I'm experiencing an odd issue with one of my lustre clients. The system seems
to be having an issue talking to one of the oss systems. When it reboots it is
somehow mounting lustre twice. attempts to use lctl ping from the client to
the OSS return the following error:
~] lctl ping 172.17.0.98@o2ib
│····failed to ping 172.17.0.98@o2ib: Input/output error
Conventional ping works
When I try to ping from the OSS side the lctl ping command hangs indefinitely.
Looking in dmesg I see the following:
[17291774.980764] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping
12345-172.17.0.30@o2ib: late network completion
│····
[17292374.970610] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping
12345-172.17.0.30@o2ib: late network completion
│····
[17292974.961746] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping
12345-172.17.0.30@o2ib: late network completion
│····
[17293602.500931] LNet: 174596:0:(api-ni.c:4116:lnet_ping()) ping
12345-172.17.0.30@o2ib: late network completion
│····
[17294234.941320] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping
12345-172.17.0.30@o2ib: late network completion
A further oddity is that mounting the lustre area seems to generate a double
mount (when I unmount it by hand I have to do it twice to get it to unmount and
it shows up twice in /proc/mounts
The client is running the following:
CentOS Linux release 7.3.1611 (Core)
kernel: 3.10.0-514.el7.x86_64
rpm -qa | grep lustre
│····
lustre-client-2.10.5-1.el7.centos.x86_64
│····
kmod-lustre-client-2.10.5-1.el7.centos.x86_64
It has a qdr infiniband interface
The OSS has the following:
CentOS Linux release 7.6.1810 (Core)
3.10.0-957.10.1.el7_lustre.x86_64
rpm -qa | grep lustre
│····
lustre-client-2.10.5-1.el7.centos.x86_64
│····
kmod-lustre-client-2.10.5-1.el7.centos.x86_64
and an FDR interface
Cables for the client have been swapped, and different qdr switches have been
used.
The client needs to stay at that version of luster so it can connect to
another, older, lustre file system.
Thank you,
Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org