Just to follow up on this thread; upgrading to lustre 2.12.9 seems to help resolve this issue.
> On Aug 20, 2024, at 7:37 AM, Makia Minich <ma...@systemfabricworks.com> wrote: > > Wondering if others may have seen something or know of a remedy. > > Late last week we had a room lose power which meant the filesystem took a > hard crash. When power was restored it looked like the JBODS made it through > and all of the luns appear to be healthy (after a little bit of rebuilding). > The servers were also able to successfully see the luns, so all looked like > it was going better than anticipated. > > The system (both server and clients) is CentOS 7.9 with Lustre 2.12.7. > > Bringing up the filesystem is when things went sideways. The MGT mounted with > no issue (standard messages of recovery), the MDT also mounted. We proceeded > to mount the OSTs when we noticed that suddenly the MDS rebooted with a > kernel panic. Looking at dmesg (after it was brought back up) we found the > following message: > > [ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread()) > lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record > for index 0/2 > [ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1 > previous similar message > [ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) > lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues > failed: -22 > [ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) > Skipped 1 previous similar message > > After a few attempts (hoping it was a fluke) the same message would cause an > assert, we had noticed this occurred with two specific OSTs. Leaving those > two OSTs down we were able to bring up the rest of the filesystem > successfully, but when either of those are mounted it appears that something > is triggered and the MDT crashes. Looking at the OSS, there's no messages on > the OSS other than losing connection to the MGS (due to the crash). > > We've tried clearing the updatelog and changelog with no change in behavior. > So, any other ideas would be appreciated. > > Below is the full dmesg from the start of mounting the MGT: > > [ 4881.624345] LDISKFS-fs (scinia): mounted filesystem with ordered data > mode. Opts: (null) > [ 6844.490777] LDISKFS-fs (scinib): mounted filesystem with ordered data > mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc > [ 6845.003014] Lustre: MGS: Connection restored to MGC192.168.240.7@tcp1_0 > (at 0@lo) > [ 6845.003021] Lustre: Skipped 1 previous similar message > [ 6853.385804] Lustre: MGS: Connection restored to > b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.9.30@tcp4) > [ 6865.882492] LDISKFS-fs (scinia): mounted filesystem with ordered data > mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc > [ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread()) > lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record > for index 0/2 > [ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1 > previous similar message > [ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) > lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues > failed: -22 > [ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) > Skipped 1 previous similar message > [ 6867.221923] Lustre: lustre01-MDT0000: Imperative Recovery not enabled, > recovery window 300-900 > [ 6867.234362] Lustre: lustre01-MDT0000: in recovery but waiting for the > first client to connect > [ 6872.207528] Lustre: lustre01-MDT0000: Connection restored to > MGC192.168.240.7@tcp1_0 (at 0@lo) > [ 6872.207536] Lustre: Skipped 1 previous similar message > [ 6902.340582] Lustre: lustre01-MDT0000: Will be in recovery for at least > 5:00, or until 7 clients reconnect > [ 6908.270425] Lustre: lustre01-MDT0000: Connection restored to > b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.249.30@tcp2) > [ 6908.270429] Lustre: Skipped 4 previous similar messages > [ 6908.446460] Lustre: lustre01-MDT0000: Recovery over after 0:06, of 7 > clients 7 recovered and 0 were evicted. > [ 6977.979707] perf: interrupt took too long (2501 > 2500), lowering > kernel.perf_event_max_sample_rate to 79000 > [ 6984.953509] Lustre: MGS: Connection restored to > 83b8e6bc-7407-a532-4b8e-0ae1a4982885 (at 192.168.240.8@tcp1) > [ 6984.953517] Lustre: Skipped 2 previous similar messages > [ 7115.328345] Lustre: MGS: Connection restored to > 1ad84e77-29b8-8d86-73e4-7dcd263c303b (at 192.168.240.9@tcp1) > [ 7115.328352] Lustre: Skipped 16 previous similar messages > [ 7201.690060] Lustre: 79892:0:(llog.c:615:llog_process_thread()) > lustre01-OST0033-osc-MDT0000: invalid length 0 in llog [0x52ad:0x1:0x0]record > for index 0/1 > [ 7201.690069] Lustre: 79892:0:(llog.c:615:llog_process_thread()) Skipped 1 > previous similar message > [ 7201.690086] LustreError: 79892:0:(osp_sync.c:1272:osp_sync_thread()) > lustre01-OST0033-osc-MDT0000: llog process with osp_sync_process_queues > failed: -22 > [ 7201.695902] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) > ASSERTION( atomic_read(&d->opd_sync_rpcs_in_progress) == 0 ) failed: > lustre01-OST0033-osc-MDT0000: 1 0 !empty > [ 7201.701242] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) LBUG > [ 7201.703862] Pid: 79892, comm: osp-syn-51-0 3.10.0-1160.21.1.el7.x86_64 #1 > SMP Tue Mar 16 18:28:22 UTC 2021 > [ 7201.703865] Call Trace: > [ 7201.703877] [<ffffffffc0f007cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] > [ 7201.703896] [<ffffffffc0f0087c>] lbug_with_loc+0x4c/0xa0 [libcfs] > [ 7201.703909] [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp] > [ 7201.703926] [<ffffffffbd2c5da1>] kthread+0xd1/0xe0 > [ 7201.703937] [<ffffffffbd995df7>] ret_from_fork_nospec_end+0x0/0x39 > [ 7201.703945] [<ffffffffffffffff>] 0xffffffffffffffff > [ 7201.703984] Kernel panic - not syncing: LBUG > [ 7201.706561] CPU: 37 PID: 79892 Comm: osp-syn-51-0 Kdump: loaded Tainted: P > OE ------------ 3.10.0-1160.21.1.el7.x86_64 #1 > [ 7201.711716] Hardware name: Dell Inc. VxFlex integrated rack R640 S/0H28RR, > BIOS 2.9.4 11/06/2020 > [ 7201.714311] Call Trace: > [ 7201.716865] [<ffffffffbd98305a>] dump_stack+0x19/0x1b > [ 7201.719418] [<ffffffffbd97c5b2>] panic+0xe8/0x21f > [ 7201.721938] [<ffffffffc0f008cb>] lbug_with_loc+0x9b/0xa0 [libcfs] > [ 7201.724425] [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp] > [ 7201.726873] [<ffffffffbd98899f>] ? __schedule+0x3af/0x860 > [ 7201.729286] [<ffffffffc1c8da50>] ? osp_sync_process_committed+0x700/0x700 > [osp] > [ 7201.731672] [<ffffffffbd2c5da1>] kthread+0xd1/0xe0 > [ 7201.734016] [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40 > [ 7201.736329] [<ffffffffbd995df7>] ret_from_fork_nospec_begin+0x21/0x21 > [ 7201.738617] [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40 > >
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org