Hi CJ, I don’t know if you ever got an account and ticket opened, but I stumbled upon this change which sounds like it could be your issue - https://jira.whamcloud.com/browse/LU-16378
commit 3c9282a67d73799a03cb1d254275685c1c1e4df2 Author: Cyril Bordage [email protected]<mailto:[email protected]> Date: Sat Dec 10 01:51:16 2022 +0100 LU-16378 lnet: handles unregister/register events When network is restarted, devices are unregistered and then registered again. When a device registers using an index that is different from the previous one (before network was restarted), LNet ignores it. Consequently, this device stays with link in fatal state. To fix that, we catch unregistering events to clear the saved index value, and when a registering event comes, we save the new value. Chris Horn From: CJ Yin <[email protected]> Date: Sunday, February 19, 2023 at 12:23 AM To: Horn, Chris <[email protected]> Cc: [email protected] <[email protected]> Subject: Re: [lustre-discuss] LNet nid down after some thing changed the NICs Hi Chris, Thanks for your help. I have collected the relevant logs according to your hints. But I need an account to open a ticket on Jira. I have sent an email to the administrator at [email protected]<mailto:[email protected]>. I was wondering if this is the correct way to apply for an account. I only found this email on the site. Regards, Chuanjun Horn, Chris <[email protected]<mailto:[email protected]>> 于2023年2月18日周六 00:52写道: If deleting and re-adding it restores the status to up then this sounds like a bug to me. Can you enable debug tracing, reproduce the issue, and add this information to a ticket? To enable/gather debug: # lctl set_param debug=+net <reproduce issue> # lctl dk > /tmp/dk.log You can create a ticket at https://jira.whamcloud.com/<https://jira.whamcloud.com/> Please provide the dk.log with the ticket. Thanks, Chris Horn From: lustre-discuss <[email protected]<mailto:[email protected]>> on behalf of 腐朽银 via lustre-discuss <[email protected]<mailto:[email protected]>> Date: Friday, February 17, 2023 at 2:53 AM To: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: [lustre-discuss] LNet nid down after some thing changed the NICs Hi, I encountered a problem when using Lustre Client on k8s with kubenet. Very happy if you could help me. My LNet configuration is: net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp local NI(s): - nid: 10.224.0.5@tcp status: up interfaces: 0: eth0 It works. But after I deploy or delete a pod on the node. The nid goes down like: - nid: 10.224.0.5@tcp status: down interfaces: 0: eth0 k8s uses veth pairs, so it will add or delete network interfaces when deploying or deleting pods. But it doesn't touch the eth0 NIC. I can fix it by deleting the tcp net by `lnetctl net del` and re-add it by `lnetctl net add`. But I need to do this every time after a pod is scheduled to this node. My node OS is Ubuntu 18.04 5.4.0-1101-azure. The Lustre Client is built by myself from 2.15.1. Is this an expected LNet behavior or I got something wrong? I re-build and tested it several times and got the same problem. Regards, Chuanjun
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
