I tried remounting the /home lustre file system to /mnt in read-only mode and when I try to ls the directory it locks up but I can escape it, how ever when I do a df command i get the completely wrong size (should be around 192TB):
10.140.93.42@o2ib:/home 6.0P 4.8P 1.3P 80% /mnt zfs scrub is still working and all disks physically report as OK in the ILO of the two OSS servers... When the scrub finishes later today I will unmount and remount the 4 OSTs and see if the remount changes the status... updates in about 8 hours. Sid Young On Tue, Oct 12, 2021 at 8:18 AM Sid Young <[email protected]> wrote: > >> 2. Tools to check a lustre (Sid Young) >> 4. Re: Tools to check a lustre (Dennis Nelson) >> >> >> My key issue is why /home locks solid when you try to use it but /lustre > is OK . The backend is ZFS used to manage the disks presented from the HP > D8000 JBOD > I'm at a loss after 6 months of 100% operation why this is suddenly > occurring. If I do repeated "dd" tasks on lustre it works fine, start one > on /home and it locks solid. > > I have started a ZFS scrub on two of the zfs pools. at 47T each it will > take most of today to resolve, but that should rule out the actual storage > (which is showing "NORMAL/ONLINE" and no errors. > > I'm seeing a lot of these in /var/log/messages > kernel: LustreError: 6578:0:(events.c:200:client_bulk_callback()) event > type 1, status -5, desc ffff89cdf3b9dc00 > A google search returned this: > https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency > > Could it be a network issue? - the nodes are running the > Centos7.9 drivers... the Mellanox one did not seam to make any difference > when I originally tried it 6 months ago. > > Any help appreciated :) > > Sid > > >> >> ---------- Forwarded message ---------- >> From: Sid Young <[email protected]> >> To: lustre-discuss <[email protected]> >> Cc: >> Bcc: >> Date: Mon, 11 Oct 2021 16:07:56 +1000 >> Subject: [lustre-discuss] Tools to check a lustre >> >> I'm having trouble diagnosing where the problem lies in my Lustre >> installation, clients are 2.12.6 and I have a /home and /lustre >> filesystems using Lustre. >> >> /home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs >> as ACTIVE. >> >> The /lustre file system appears fine, I can *ls *into every directory. >> >> When people log into the login node, it appears to lockup. I have shut >> down everything and remounted the OSTs and MDTs etc in order with no >> errors reporting but I'm getting the lockup issue soon after a few people >> log in. >> The backend network is 100G Ethernet using ConnectX5 cards and the OS is >> Cento 7.9, everything was installed as RPMs and updates are disabled in >> yum.conf >> >> Two questions to start with: >> Is there a command line tool to check each OST individually? >> Apart from /var/log/messages, is there a lustre specific log I can >> monitor on the login node to see errors when I hit /home... >> >> >> >> Sid Young >> >> >> >> >> >> >> >> ---------- Forwarded message ---------- >> From: Dennis Nelson <[email protected]> >> To: Sid Young <[email protected]> >> >> Date: Mon, 11 Oct 2021 12:20:25 +0000 >> Subject: Re: [lustre-discuss] Tools to check a lustre >> Have you tried lfs check servers on the login node? >> > > Yes - one of the first things I did and this is what it always reports: > > ]# lfs check servers > home-OST0000-osc-ffff89adb7e5e000 active. > home-OST0001-osc-ffff89adb7e5e000 active. > home-OST0002-osc-ffff89adb7e5e000 active. > home-OST0003-osc-ffff89adb7e5e000 active. > lustre-OST0000-osc-ffff89cdd14a2000 active. > lustre-OST0001-osc-ffff89cdd14a2000 active. > lustre-OST0002-osc-ffff89cdd14a2000 active. > lustre-OST0003-osc-ffff89cdd14a2000 active. > lustre-OST0004-osc-ffff89cdd14a2000 active. > lustre-OST0005-osc-ffff89cdd14a2000 active. > home-MDT0000-mdc-ffff89adb7e5e000 active. > lustre-MDT0000-mdc-ffff89cdd14a2000 active. > [root@tri-minihub-01 ~]# > >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
