You are likely hitting that bug https://jira.whamcloud.com/browse/LU-15207 which is fixed in (not yet released) 2.16.0
Aurélien ________________________________ De : lustre-discuss <[email protected]> de la part de Lixin Liu via lustre-discuss <[email protected]> Envoyé : mercredi 29 novembre 2023 17:18 À : lustre-discuss <[email protected]> Objet : [lustre-discuss] MDS crashes, lustre version 2.15.3 External email: Use caution opening links or attachments Hi, We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs are using ZFS. The system seems to perform well at the beginning, but recently, we see frequent MDS crashes. The vmcore-dmesg.txt shows the following: [26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) ASSERTION( !cfs_hash_is_rehashing(hs) ) failed: [26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG [26056.051460] Pid: 69513, comm: lquota_wb_cedar 4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023 [26056.063099] Call Trace TBD: [26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs] [26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs] [26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs] [26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota] [26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota] [26056.096008] [<0>] kthread+0x134/0x150 [26056.100098] [<0>] ret_from_fork+0x35/0x40 [26056.104575] Kernel panic - not syncing: LBUG [26056.109337] CPU: 18 PID: 69513 Comm: lquota_wb_cedar Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.10.1.el8_lustre.x86_64 #1 [26056.123892] Hardware name: /086D43, BIOS 2.17.0 03/15/2023 [26056.130108] Call Trace: [26056.132833] dump_stack+0x41/0x60 [26056.136532] panic+0xe7/0x2ac [26056.139843] ? ret_from_fork+0x35/0x40 [26056.144022] ? qsd_id_lock_cancel+0x2d0/0x2d0 [lquota] [26056.149762] lbug_with_loc.cold.8+0x18/0x18 [libcfs] [26056.155306] cfs_hash_for_each_tight+0x301/0x310 [libcfs] [26056.161335] ? wait_for_completion+0xb8/0x100 [26056.166196] qsd_start_reint_thread+0x561/0xcc0 [lquota] [26056.172128] qsd_upd_thread+0xd43/0x1040 [lquota] [26056.177381] ? __schedule+0x2d9/0x870 [26056.181466] ? qsd_bump_version+0x3b0/0x3b0 [lquota] [26056.187010] kthread+0x134/0x150 [26056.190608] ? set_kthread_struct+0x50/0x50 [26056.195272] ret_from_fork+0x35/0x40 We also experienced unexpected OST drop (change to inactive mode) from login nodes and the only way to bring it back is to reboot the client. Any suggestions? Thanks, Lixin Liu Simon Fraser University _______________________________________________ lustre-discuss mailing list [email protected] https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Cadegremont%40nvidia.com%7C582bca94cf834808213608dbf0f715f4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638368716249086797%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BIocBxmJ9zefa%2B1iutVyuD%2FAVdmn%2FpaHnCFiqBIuRgY%3D&reserved=0<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
