Hello Andreas,

Thanks for your reply and tips.

We found this case was caused by removing Lustre modules(uninstall Lustre rpms) 
without unmount Lustre instance. It means there are no any notifications to 
Lustre servers, and servers tried to recovery the connection again and again.

The good thing is that LNetError stopped after I run the following command to 
remove the export.  I don't know is there any other better way to clean up the 
removed clients. Disconnect in LNET level?

[root@mds2 ~]# echo "10.67.178.25@tcp" > 
/proc/fs/lustre/mdt/data-MDT0000/exports/clear

Thank you.

Regards,
Qiulan


________________________________
From: Andreas Dilger <[email protected]>
Sent: Friday, December 8, 2023 6:49 PM
To: Huang, Qiulan <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [lustre-discuss] Lustre server still try to recover the lnet reply 
to the depreciated clients

If you are evicting a client by NID, then use the "nid:" keyword:

    lctl set_param mdt.*.evict_client=nid:10.68.178.25@tcp

Otherwise it is expecting the input to be in the form of a client UUID (to allow
evicting a single export from a client mounting the filesystem multiple times).

That said, the client *should* be evicted by the server automatically, so it 
isn't
clear why this isn't happening.  Possibly this is something at the LNet level
(which unfortunately I don't know much about)?

Cheers, Andreas

> On Dec 6, 2023, at 13:23, Huang, Qiulan via lustre-discuss 
> <[email protected]> wrote:
>
>
>
> Hello all,
>
>
> We removed some clients two weeks ago but we see the Lustre server is still 
> trying to handle the lnet recovery reply to those clients (the error log is 
> posted as below). And they are still listed in the exports dir.
>
>
> I tried to run  to evict the clients but failed with  the error "no exports 
> found"
>
> lctl set_param mdt.*.evict_client=10.68.178.25@tcp
>
>
> Do you know how to clean up the removed the depreciated clients? Any 
> suggestions would be greatly appreciated.
>
>
>
> For example:
>
> [root@mds2 ~]# ll /proc/fs/lustre/mdt/data-MDT0000/exports/10.67.178.25@tcp/
> total 0
> -r--r--r-- 1 root root 0 Dec  5 15:41 export
> -r--r--r-- 1 root root 0 Dec  5 15:41 fmd_count
> -r--r--r-- 1 root root 0 Dec  5 15:41 hash
> -rw-r--r-- 1 root root 0 Dec  5 15:41 ldlm_stats
> -r--r--r-- 1 root root 0 Dec  5 15:41 nodemap
> -r--r--r-- 1 root root 0 Dec  5 15:41 open_files
> -r--r--r-- 1 root root 0 Dec  5 15:41 reply_data
> -rw-r--r-- 1 root root 0 Aug 14 10:58 stats
> -r--r--r-- 1 root root 0 Dec  5 15:41 uuid
>
>
>
>
>
> /var/log/messages:Dec  6 12:50:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:20:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:20:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:35:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:35:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 
> 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.176.25@tcp) recovery failed with -111
> /var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 
> 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 3 previous 
> similar messages
> /var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 
> 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.176.25@tcp) recovery failed with -111
> /var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 
> 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 8 previous 
> similar messages
> /var/log/messages:Dec  6 15:02:14 mds2 kernel: LNetError: 
> 3817248:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.176.25@tcp) recovery failed with -111
>
>
> Regards,
> Qiulan
> _______________________________________________
> lustre-discuss mailing list
> [email protected]
> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!P4SdNyxKAPE!CUXLBOOw5KZoyNO5v4zxJoWzkgbz9boeSUQlOzVppwOEbfbxfCnnuHjbvn_gZ1toVmKpWTNRHdF8eMm9hCw$

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to