Hi CJ,

I don’t know if you ever got an account and ticket opened, but I stumbled upon 
this change which sounds like it could be your issue - 
https://jira.whamcloud.com/browse/LU-16378

commit 3c9282a67d73799a03cb1d254275685c1c1e4df2
Author: Cyril Bordage [email protected]<mailto:[email protected]>
Date:   Sat Dec 10 01:51:16 2022 +0100

    LU-16378 lnet: handles unregister/register events

    When network is restarted, devices are unregistered and then
   registered again. When a device registers using an index that is
    different from the previous one (before network was restarted), LNet
    ignores it. Consequently, this device stays with link in fatal state.

    To fix that, we catch unregistering events to clear the saved index
    value, and when a registering event comes, we save the new value.

Chris Horn

From: CJ Yin <[email protected]>
Date: Sunday, February 19, 2023 at 12:23 AM
To: Horn, Chris <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [lustre-discuss] LNet nid down after some thing changed the NICs
Hi Chris,

Thanks for your help. I have collected the relevant logs according to your 
hints. But I need an account to open a ticket on Jira. I have sent an email to 
the administrator at [email protected]<mailto:[email protected]>. I was 
wondering if this is the correct way to apply for an account. I only found this 
email on the site.

Regards,
Chuanjun

Horn, Chris <[email protected]<mailto:[email protected]>> 于2023年2月18日周六 
00:52写道:
If deleting and re-adding it restores the status to up then this sounds like a 
bug to me.

Can you enable debug tracing, reproduce the issue, and add this information to 
a ticket?

To enable/gather debug:

# lctl set_param debug=+net
<reproduce issue>
# lctl dk > /tmp/dk.log

You can create a ticket at 
https://jira.whamcloud.com/<https://jira.whamcloud.com/>

Please provide the dk.log with the ticket.

Thanks,
Chris Horn

From: lustre-discuss 
<[email protected]<mailto:[email protected]>>
 on behalf of 腐朽银 via lustre-discuss 
<[email protected]<mailto:[email protected]>>
Date: Friday, February 17, 2023 at 2:53 AM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: [lustre-discuss] LNet nid down after some thing changed the NICs
Hi,

I encountered a problem when using Lustre Client on k8s with kubenet. Very 
happy if you could help me.

My LNet configuration is:

net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 10.224.0.5@tcp
          status: up
          interfaces:
              0: eth0

It works. But after I deploy or delete a pod on the node. The nid goes down 
like:

- nid: 10.224.0.5@tcp
          status: down
          interfaces:
              0: eth0

k8s uses veth pairs, so it will add or delete network interfaces when deploying 
or deleting pods. But it doesn't touch the eth0 NIC. I can fix it by deleting 
the tcp net by `lnetctl net del` and re-add it by `lnetctl net add`. But I need 
to do this every time after a pod is scheduled to this node.

My node OS is Ubuntu 18.04 5.4.0-1101-azure. The Lustre Client is built by 
myself from 2.15.1. Is this an expected LNet behavior or I got something wrong? 
I re-build and tested it several times and got the same problem.

Regards,
Chuanjun
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to