On Wed, 2025-02-12 at 21:26 +0000, Walter, Eric wrote: > > Hello, > > > We have recently upgraded a cluster to Rocky 9.5 (kernel > version5.14.0-503.22.1.el9_5.x86_64). After upgrading to lustre- > 2.15.6 client, we are seeing repeated kernel oops / crashes when jobs > are reading/writing to both of our lustre filesystems after about 3-4 > hours of running. It is repeatable and results in a Kernel oops > referencing the ldlm process of lustre. This is just our clients > that are on Rocky 9.5, none other systems are having issues.
first hit for ll_prune_negative_children on jira leads to this ticket that links to the fix: https://jira.whamcloud.com/browse/LU-18085 > > > We would normally mount with o2ib (we upgraded to Mellanox driver > version 24.10-1.1.4.0 for Rocky 9.5), however, our tests still result > in the same ldlm kernel oops when mounted over tcp. > > > The oops related output in from vmcore-dmesg.txt is posted below. > > > I have looked for various known issues with 2.15.6 and can't find > anyone else reporting this. Any ideas on what to do besides > downgrade to Rocky 9.4? Has anyone else seen such a problem with 9.5 > and clients using v2.15.6? > > > [ 6267.182434] BUG: kernel NULL pointer dereference, address: > 0000000000000004 > [ 6267.182441] #PF: supervisor write access in kernel mode > [ 6267.182443] #PF: error_code(0x0002) - not-present page > [ 6267.182444] PGD 1924d7067 P4D 134554067 PUD 10ac05067 PMD 0 > [ 6267.182449] Oops: 0002 [#1] PREEMPT SMP NOPTI > 6267.182451] CPU: 15 PID: 3599 Comm: ldlm_bl_04 Kdump: loaded > Tainted: G OE ------- --- 5.14.0- > 503.22.1.el9_5.x86_64 #1 > [ 6267.182454] Hardware name: Dell Inc. PowerEdge R6625/0NWPW3, BIOS > 1.5.8 07/21/2023 > [ 6267.182455] RIP: 0010:ll_prune_negative_children+0x9d/0x250 > [lustre] > [ 6267.182483] Code: 00 00 48 85 ed 74 46 48 81 ed 98 00 00 00 74 3d > 48 83 7d 30 00 75 e4 4c 8d 7d 60 4c 89 ff e8 da 20 fb cf 48 8b 85 80 > 00 00 00 <80> 48 04 01 8b 45 64 85 c0 0f 84 ae 00 00 00 4c 89 ff e8 > ac 21 fb > [ 6267.182485] RSP: 0018:ff75eed96a0c7c90 EFLAGS: 00010246 > [ 6267.182487] RAX: 0000000000000000 RBX: ff28db3ed37d92c0 RCX: > 0000000000000000 > [ 6267.182488] RDX: 0000000000000001 RSI: ff28db0fdb1e00b0 RDI: > ff28db0fc22c9860 > [ 6267.182489] RBP: ff28db0fc22c9800 R08: 0000000000000000 R09: > ffffffa1dd3f0088 > [ 6267.182489] R10: ff28db3ec76f5c00 R11: 000000000005eee0 R12: > ff28db3ed37d9320 > [ 6267.182490] R13: ff28db3ece52d528 R14: ff28db3ece52d4a0 R15: > ff28db0fc22c9860 > [ 6267.182491] FS: 0000000000000000(0000) GS:ff28db3dfebc0000(0000) > knlGS:0000000000000000 > [ 6267.182493] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 6267.182494] CR2: 0000000000000004 CR3: 0000000138eec006 CR4: > 0000000000771ef0 > [ 6267.182495] PKRU: 55555554 > [ 6267.182495] Call Trace: > [ 6267.182499] <TASK> > [ 6267.182500] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 6267.182506] ? show_trace_log_lvl+0x26e/0x2df > [ 6267.182513] ? show_trace_log_lvl+0x26e/0x2df > [ 6267.182517] ? ll_lock_cancel_bits+0x73a/0x760 [lustre] > [ 6267.182535] ? __die_body.cold+0x8/0xd > [ 6267.182538] ? page_fault_oops+0x134/0x170 > [ 6267.182542] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 6267.182545] ? exc_page_fault+0x62/0x150 > [ 6267.182549] ? asm_exc_page_fault+0x22/0x30 > [ 6267.182553] ? ll_prune_negative_children+0x9d/0x250 [lustre] > [ 6267.182570] ll_lock_cancel_bits+0x73a/0x760 [lustre] > [ 6267.182588] ll_md_blocking_ast+0x1a3/0x300 [lustre] > [ 6267.182606] ldlm_cancel_callback+0x7a/0x290 [ptlrpc] > [ 6267.182639] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 6267.182642] ldlm_cli_cancel_local+0xce/0x440 [ptlrpc] > [ 6267.182674] ldlm_cli_cancel+0x271/0x520 [ptlrpc] > [ 6267.182705] ll_md_blocking_ast+0x1cd/0x300 [lustre] > [ 6267.182722] ldlm_handle_bl_callback+0x105/0x3e0 [ptlrpc] > [ 6267.182753] ldlm_bl_thread_blwi.constprop.0+0xa7/0x340 [ptlrpc] > [ 6267.182782] ldlm_bl_thread_main+0x533/0x610 [ptlrpc] > [ 6267.182811] ? __pfx_autoremove_wake_function+0x10/0x10 > [ 6267.182817] ? __pfx_ldlm_bl_thread_main+0x10/0x10 [ptlrpc] > [ 6267.182846] kthread+0xdd/0x100 > [ 6267.182851] ? __pfx_kthread+0x10/0x10 > [ 6267.182853] ret_from_fork+0x29/0x50 > [ 6267.182859] </TASK> > [ 6267.182860] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) > fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE) > lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver libcfs(OE) > nfs lockd grace fscache netfs rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) > ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc binfmt_misc vfat fat > amd_atl intel_rapl_msr ipmi_ssif intel_rapl_common amd64_edac > dell_wmi edac_mce_amd ledtrig_audio sparse_keymap rfkill kvm_amd > mgag200 acpi_ipmi i2c_algo_bit video drm_shmem_helper kvm ipmi_si > dell_smbios ipmi_devintf dcdbas drm_kms_helper dell_wmi_descriptor > rapl wmi_bmof pcspkr i2c_piix4 ipmi_msghandler k10temp > acpi_power_meter fuse drm xfs libcrc32c mlx5_ib(OE) macsec > ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) sd_mod t10_pi > psample ahci mlxdevm(OE) sg libahci mlx_compat(OE) crct10dif_pclmul > crc32_pclmul crc32c_intel tls libata ghash_clmulni_intel tg3 ccp > megaraid_sas pci_hyperv_intf sp5100_tco wmi dm_mirror dm_region_hash > dm_log dm_mod xpmem(OE) > [ 6267.182922] CR2: 0000000000000004 > > > Thanks for any help you can provide. > > > Eric > > > > > > > > -- > Eric J. Walter > Executive Director, Research Computing > Information Technology > > William & Mary > Office: 757-221-1886 > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org