Hi,

We just had a similar issue on 2.15.5. Infiniband clients not reconnecting after a target outage.

Deleting the LNet net and importing the config again solved it without reboot and unmount:

# letctl net del --net 02ib
# lnetctl import < /etc/lnet.conf

Cheers,
Hans Henrik

On 28/08/2024 18.18, Lixin Liu via lustre-discuss wrote:

We had the same problem after we upgraded Lustre servers from 2.12.8 to 2.15.3.

Clients were running 2.15.3 on CentOS 7. Random OST dropped out frequently on

busy login nodes (almost daily), but less so on compute nodes. “lctl” command

cannot active OSTs and reboot we the only way to clear the problem.

In June, we upgraded all client OS to AlmaLinux 9.3 and Lustre version to 2.15.4 on

both servers and clients (missed 2.15.5 release by about 2 weeks). After the upgrade,

we no longer have this problem.

In our case, I wonder this was OmniPath related. Servers on AlamLinux 8 was using

in kernel driver, but CentOS 7 clients are using driver from Intel/Cornelis release.

Alma 9 clients are now also using in kernel driver.

Cheers,

Lixin.

*From: *lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Cameron Harr via lustre-discuss <lustre-discuss@lists.lustre.org>
*Reply-To: *Cameron Harr <ha...@llnl.gov>
*Date: *Wednesday, August 28, 2024 at 8:19 AM
*To: *"lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
*Subject: *Re: [lustre-discuss] How to activate an OST on a client ?

There's also an "lctl --device <dev> activate" that I've used in the past though I don't know what conditions need to be for it to work.

On 8/27/24 07:46, Andreas Dilger via lustre-discuss wrote:

    Hi Jan,

    There is "lctl --device XXXX recover" that will trigger a
    reconnect to the named OST device (per "lctl dl" output), but not
    sure if that will help.

    Cheers, Andreas



        On Aug 22, 2024, at 06:36, Haarst, Jan van via lustre-discuss
        <lustre-discuss@lists.lustre.org>
        <mailto:lustre-discuss@lists.lustre.org> wrote:

        Hi,

        Probably the wording of the subject doesn’t actually cover the
        issue, what we see is this :

        We have a client behind a router (linking tcp to Omnipath)
        that shows an inactive OST (all on 2.15.5).

        Other clients that go through the router do not have this issue.

        One client had the same issue, although it showed a different
        OST as inactive.

        After a reboot, all was well again on that machine.

        The clients can lctl ping the OSSs.

        So although we have a workaround (reboot the client), it would
        be nice to:

         1. Fix the issue without a reboot
         2. Fix the underlying issue.

        It might be unrelated, but we also see another routing issue
        every now and then:

        The router stops routing request toward a certain OSS, and
        this can be fixed by deleting the peer_nid of the OSS from the
        router.

        I am probably missing informative logs, but I’m more than
        happy to try to generate them, if somebody has a pointer to how.

        We are a bit stumped right now.

        With kind regards,

--
        Jan van Haarst

        HPC Administrator

        For Anunna/HPC questions, please use https://support.wur.nl
        
<https://urldefense.us/v3/__https:/support.wur.nl__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mMjyuAoAg$>
 (with
        HPC as service)

        Aanwezig: maandag, dinsdag, donderdag & vrijdag

        Facilitair Bedrijf, onderdeel van Wageningen University &
        Research

        Afdeling Informatie Technologie

        Postbus 59, 6700 AB, Wageningen

        Gebouw 116, Akkermaalsbos 12, 6700 WB, Wageningen

        http://www.wur.nl/nl/Disclaimer.htm
        
<https://urldefense.us/v3/__http:/www.wur.nl/nl/Disclaimer.htm__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mP2LXgG1Q$>

        _______________________________________________
        lustre-discuss mailing list
        lustre-discuss@lists.lustre.org
        http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
        
<https://urldefense.us/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$>



    _______________________________________________

    lustre-discuss mailing list

    lustre-discuss@lists.lustre.org

https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$ <https://urldefense.us/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$>

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to