You are likely hitting that bug https://jira.whamcloud.com/browse/LU-15207 
which is fixed in (not yet released) 2.16.0

Aurélien
________________________________
De : lustre-discuss <[email protected]> de la part de 
Lixin Liu via lustre-discuss <[email protected]>
Envoyé : mercredi 29 novembre 2023 17:18
À : lustre-discuss <[email protected]>
Objet : [lustre-discuss] MDS crashes, lustre version 2.15.3

External email: Use caution opening links or attachments


Hi,

We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs 
are using ZFS.
The system seems to perform well at the beginning, but recently, we see 
frequent MDS crashes.
The vmcore-dmesg.txt shows the following:

[26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) 
ASSERTION( !cfs_hash_is_rehashing(hs) ) failed:
[26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG
[26056.051460] Pid: 69513, comm: lquota_wb_cedar 
4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023
[26056.063099] Call Trace TBD:
[26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.096008] [<0>] kthread+0x134/0x150
[26056.100098] [<0>] ret_from_fork+0x35/0x40
[26056.104575] Kernel panic - not syncing: LBUG
[26056.109337] CPU: 18 PID: 69513 Comm: lquota_wb_cedar Kdump: loaded Tainted: 
G           OE    --------- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1
[26056.123892] Hardware name:  /086D43, BIOS 2.17.0 03/15/2023
[26056.130108] Call Trace:
[26056.132833]  dump_stack+0x41/0x60
[26056.136532]  panic+0xe7/0x2ac
[26056.139843]  ? ret_from_fork+0x35/0x40
[26056.144022]  ? qsd_id_lock_cancel+0x2d0/0x2d0 [lquota]
[26056.149762]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[26056.155306]  cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.161335]  ? wait_for_completion+0xb8/0x100
[26056.166196]  qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.172128]  qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.177381]  ? __schedule+0x2d9/0x870
[26056.181466]  ? qsd_bump_version+0x3b0/0x3b0 [lquota]
[26056.187010]  kthread+0x134/0x150
[26056.190608]  ? set_kthread_struct+0x50/0x50
[26056.195272]  ret_from_fork+0x35/0x40

We also experienced unexpected OST drop (change to inactive mode) from login 
nodes and the only
way to bring it back is to reboot the client.

Any suggestions?

Thanks,

Lixin Liu
Simon Fraser University

_______________________________________________
lustre-discuss mailing list
[email protected]
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Cadegremont%40nvidia.com%7C582bca94cf834808213608dbf0f715f4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638368716249086797%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BIocBxmJ9zefa%2B1iutVyuD%2FAVdmn%2FpaHnCFiqBIuRgY%3D&reserved=0<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to