Good Afternoon,

I'm experiencing an odd issue with one of my lustre clients.   The system seems 
to be having an issue talking to one of the oss systems.  When it reboots it is 
somehow mounting lustre twice.  attempts to use lctl ping from the client to 
the OSS return the following error:

~] lctl ping 172.17.0.98@o2ib
  │····failed to ping 172.17.0.98@o2ib: Input/output error

Conventional ping works

When I try to ping from the OSS side the lctl ping command hangs indefinitely.  
Looking in dmesg I see the following:
[17291774.980764] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping 
12345-172.17.0.30@o2ib: late network completion                                 
                    │····
[17292374.970610] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping 
12345-172.17.0.30@o2ib: late network completion                                 
                    │····
[17292974.961746] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping 
12345-172.17.0.30@o2ib: late network completion                                 
                    │····
[17293602.500931] LNet: 174596:0:(api-ni.c:4116:lnet_ping()) ping 
12345-172.17.0.30@o2ib: late network completion                                 
                   │····
[17294234.941320] LNet: 86013:0:(api-ni.c:4116:lnet_ping()) ping 
12345-172.17.0.30@o2ib: late network completion

A further oddity is that mounting the lustre area seems to generate a double 
mount (when I unmount it by hand I have to do it twice to get it to unmount and 
it shows up twice in /proc/mounts

The client is running the following:
CentOS Linux release 7.3.1611 (Core)
kernel: 3.10.0-514.el7.x86_64
rpm -qa | grep lustre                                                           
                                                                   │····
lustre-client-2.10.5-1.el7.centos.x86_64                                        
                                                                                
     │····
kmod-lustre-client-2.10.5-1.el7.centos.x86_64

It has a qdr infiniband interface

The OSS has the following:
CentOS Linux release 7.6.1810 (Core)
3.10.0-957.10.1.el7_lustre.x86_64
rpm -qa | grep lustre                                                           
                                                                   │····
lustre-client-2.10.5-1.el7.centos.x86_64                                        
                                                                                
     │····
kmod-lustre-client-2.10.5-1.el7.centos.x86_64
and an FDR interface

Cables for the client have been swapped, and different qdr switches have been 
used.

The client needs to stay at that version of luster so it can connect to 
another, older, lustre file system.

Thank you,

Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to