Just to follow up on this thread; upgrading to lustre 2.12.9 seems to help 
resolve this issue.

> On Aug 20, 2024, at 7:37 AM, Makia Minich <ma...@systemfabricworks.com> wrote:
> 
> Wondering if others may have seen something or know of a remedy.
> 
> Late last week we had a room lose power which meant the filesystem took a 
> hard crash. When power was restored it looked like the JBODS made it through 
> and all of the luns appear to be healthy (after a little bit of rebuilding). 
> The servers were also able to successfully see the luns, so all looked like 
> it was going better than anticipated.
> 
> The system (both server and clients) is CentOS 7.9 with Lustre 2.12.7.
> 
> Bringing up the filesystem is when things went sideways. The MGT mounted with 
> no issue (standard messages of recovery), the MDT also mounted. We proceeded 
> to mount the OSTs when we noticed that suddenly the MDS rebooted with a 
> kernel panic. Looking at dmesg (after it was brought back up) we found the 
> following message:
> 
> [ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread()) 
> lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record 
> for index 0/2
> [ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1 
> previous similar message
> [ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) 
> lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues 
> failed: -22
> [ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) 
> Skipped 1 previous similar message
> 
> After a few attempts (hoping it was a fluke) the same message would cause an 
> assert, we had noticed this occurred with two specific OSTs. Leaving those 
> two OSTs down we were able to bring up the rest of the filesystem 
> successfully, but when either of those are mounted it appears that something 
> is triggered and the MDT crashes. Looking at the OSS, there's no messages on 
> the OSS other than losing connection to the MGS (due to the crash).
> 
> We've tried clearing the updatelog and changelog with no change in behavior. 
> So, any other ideas would be appreciated.
> 
> Below is the full dmesg from the start of mounting the MGT:
> 
> [ 4881.624345] LDISKFS-fs (scinia): mounted filesystem with ordered data 
> mode. Opts: (null)
> [ 6844.490777] LDISKFS-fs (scinib): mounted filesystem with ordered data 
> mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
> [ 6845.003014] Lustre: MGS: Connection restored to MGC192.168.240.7@tcp1_0 
> (at 0@lo)
> [ 6845.003021] Lustre: Skipped 1 previous similar message
> [ 6853.385804] Lustre: MGS: Connection restored to 
> b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.9.30@tcp4)
> [ 6865.882492] LDISKFS-fs (scinia): mounted filesystem with ordered data 
> mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
> [ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread()) 
> lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record 
> for index 0/2
> [ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1 
> previous similar message
> [ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) 
> lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues 
> failed: -22
> [ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) 
> Skipped 1 previous similar message
> [ 6867.221923] Lustre: lustre01-MDT0000: Imperative Recovery not enabled, 
> recovery window 300-900
> [ 6867.234362] Lustre: lustre01-MDT0000: in recovery but waiting for the 
> first client to connect
> [ 6872.207528] Lustre: lustre01-MDT0000: Connection restored to 
> MGC192.168.240.7@tcp1_0 (at 0@lo)
> [ 6872.207536] Lustre: Skipped 1 previous similar message
> [ 6902.340582] Lustre: lustre01-MDT0000: Will be in recovery for at least 
> 5:00, or until 7 clients reconnect
> [ 6908.270425] Lustre: lustre01-MDT0000: Connection restored to 
> b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.249.30@tcp2)
> [ 6908.270429] Lustre: Skipped 4 previous similar messages
> [ 6908.446460] Lustre: lustre01-MDT0000: Recovery over after 0:06, of 7 
> clients 7 recovered and 0 were evicted.
> [ 6977.979707] perf: interrupt took too long (2501 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 79000
> [ 6984.953509] Lustre: MGS: Connection restored to 
> 83b8e6bc-7407-a532-4b8e-0ae1a4982885 (at 192.168.240.8@tcp1)
> [ 6984.953517] Lustre: Skipped 2 previous similar messages
> [ 7115.328345] Lustre: MGS: Connection restored to 
> 1ad84e77-29b8-8d86-73e4-7dcd263c303b (at 192.168.240.9@tcp1)
> [ 7115.328352] Lustre: Skipped 16 previous similar messages
> [ 7201.690060] Lustre: 79892:0:(llog.c:615:llog_process_thread()) 
> lustre01-OST0033-osc-MDT0000: invalid length 0 in llog [0x52ad:0x1:0x0]record 
> for index 0/1
> [ 7201.690069] Lustre: 79892:0:(llog.c:615:llog_process_thread()) Skipped 1 
> previous similar message
> [ 7201.690086] LustreError: 79892:0:(osp_sync.c:1272:osp_sync_thread()) 
> lustre01-OST0033-osc-MDT0000: llog process with osp_sync_process_queues 
> failed: -22
> [ 7201.695902] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) 
> ASSERTION( atomic_read(&d->opd_sync_rpcs_in_progress) == 0 ) failed: 
> lustre01-OST0033-osc-MDT0000: 1 0 !empty
> [ 7201.701242] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) LBUG
> [ 7201.703862] Pid: 79892, comm: osp-syn-51-0 3.10.0-1160.21.1.el7.x86_64 #1 
> SMP Tue Mar 16 18:28:22 UTC 2021
> [ 7201.703865] Call Trace:
> [ 7201.703877]  [<ffffffffc0f007cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
> [ 7201.703896]  [<ffffffffc0f0087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
> [ 7201.703909]  [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp]
> [ 7201.703926]  [<ffffffffbd2c5da1>] kthread+0xd1/0xe0
> [ 7201.703937]  [<ffffffffbd995df7>] ret_from_fork_nospec_end+0x0/0x39
> [ 7201.703945]  [<ffffffffffffffff>] 0xffffffffffffffff
> [ 7201.703984] Kernel panic - not syncing: LBUG
> [ 7201.706561] CPU: 37 PID: 79892 Comm: osp-syn-51-0 Kdump: loaded Tainted: P 
>           OE  ------------   3.10.0-1160.21.1.el7.x86_64 #1
> [ 7201.711716] Hardware name: Dell Inc. VxFlex integrated rack R640 S/0H28RR, 
> BIOS 2.9.4 11/06/2020
> [ 7201.714311] Call Trace:
> [ 7201.716865]  [<ffffffffbd98305a>] dump_stack+0x19/0x1b
> [ 7201.719418]  [<ffffffffbd97c5b2>] panic+0xe8/0x21f
> [ 7201.721938]  [<ffffffffc0f008cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
> [ 7201.724425]  [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp]
> [ 7201.726873]  [<ffffffffbd98899f>] ? __schedule+0x3af/0x860
> [ 7201.729286]  [<ffffffffc1c8da50>] ? osp_sync_process_committed+0x700/0x700 
> [osp]
> [ 7201.731672]  [<ffffffffbd2c5da1>] kthread+0xd1/0xe0
> [ 7201.734016]  [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40
> [ 7201.736329]  [<ffffffffbd995df7>] ret_from_fork_nospec_begin+0x21/0x21
> [ 7201.738617]  [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40
> 
> 

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to