Not a problem. Once you confirm, I will submit the patches to the generic Ubuntu kernel, and it will land in the gcp/aws/azure kernels via the regular SRU update route. I will update the ticket when I've sent out the review for the patches.
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws Status in linux package in Ubuntu: Triaged Status in linux-aws-5.15 package in Ubuntu: In Progress Status in linux source package in Focal: New Status in linux-aws-5.15 source package in Focal: New Bug description: Hi friends, We hit a kernel hard lockup where all CPUs are stuck acquiring an already-locked spinlock (css_set_lock) within the cgroup subsystem. Below are the call stacks from a memory dump of a two-core system taken on Ubuntu 20.04 (5.15 kernel) on AWS, but the same issue occurs on Azure and GCP too. This is happening in a non-deterministic fashion (less than 1%), and can occur at any time of the VM execution. We suspect it’s a deadlock triggered by some race condition, but we don’t know for sure. ``` PID: 21079 TASK: ffff91fdcd1dc000 CPU: 0 COMMAND: "sh" #0 [fffffe7127850cb8] machine_kexec at ffffffffadc92680 #1 [fffffe7127850d18] __crash_kexec at ffffffffadda0b9f #2 [fffffe7127850de0] panic at ffffffffae8f56be #3 [fffffe7127850e70] unknown_nmi_error.cold at ffffffffae8eb4c8 #4 [fffffe7127850e90] default_do_nmi at ffffffffae99c639 #5 [fffffe7127850eb8] exc_nmi at ffffffffae99c7db #6 [fffffe7127850ef0] end_repeat_nmi at ffffffffaea017f3 [exception RIP: native_queued_spin_lock_slowpath+63] RIP: ffffffffadd40eff RSP: ffffa1f68589fc60 RFLAGS: 00000002 (interrupt disabled!!) RAX: 0000000000000001 RBX: ffffffffb0ea5804 RCX: ffff91fb597c8980 RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffffb0ea5804 RBP: ffffa1f68589fc88 R8: 0000000000005259 R9: 00000000597c8980 R10: 0000000000000000 R11: 0000000000000000 R12: ffffa1f68589fdf8 R13: ffff91fdcd1d8000 R14: 0000000000004100 R15: ffff91fdcd1d8000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #7 [ffffa1f68589fc60] native_queued_spin_lock_slowpath at ffffffffadd40eff #8 [ffffa1f68589fc90] _raw_spin_lock_irq at ffffffffae9af19a #9 [ffffa1f68589fca0] cgroup_can_fork at ffffffffaddb0de8 #10 [ffffa1f68589fce8] copy_process at ffffffffadcc1938 #11 [ffffa1f68589fcf0] filemap_map_pages at ffffffffadeb68db #12 [ffffa1f68589fdf0] __x64_sys_vfork at ffffffffadcc2a20 #13 [ffffa1f68589fe70] x64_sys_call at ffffffffadc068a9 #14 [ffffa1f68589fe80] do_syscall_64 at ffffffffae99a9e4 #15 [ffffa1f68589fec0] exit_to_user_mode_prepare at ffffffffadd725ad #16 [ffffa1f68589ff00] irqentry_exit_to_user_mode at ffffffffae99f43e #17 [ffffa1f68589ff10] irqentry_exit at ffffffffae99f46d #18 [ffffa1f68589ff18] clear_bhb_loop at ffffffffaea018c5 #19 [ffffa1f68589ff28] clear_bhb_loop at ffffffffaea018c5 #20 [ffffa1f68589ff38] clear_bhb_loop at ffffffffaea018c5 #21 [ffffa1f68589ff50] entry_SYSCALL_64_after_hwframe at ffffffffaea00124 RIP: 00007fddfa4cebcc RSP: 00007fffaa741990 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 000055ea66750428 RCX: 00007fddfa4cebcc RDX: 0000000000000000 RSI: 00007fffaa7419c0 RDI: 000055ea663c8866 RBP: 0000000000000003 R8: 00007fffaa7419c0 R9: 000055ea667505f0 R10: 0000000000000008 R11: 0000000000000202 R12: 00007fffaa7419c0 R13: 00007fffaa741ae0 R14: 0000000000000000 R15: 000055ea663de810 ORIG_RAX: 000000000000003a CS: 0033 SS: 002b PID: 20304 TASK: ffff91fb05440000 CPU: 1 COMMAND: "Writer:Driver>C" #0 [fffffe6c293d3e10] crash_nmi_callback at ffffffffadc81ec0 #1 [fffffe6c293d3e48] nmi_handle at ffffffffadc49b03 #2 [fffffe6c293d3e90] default_do_nmi at ffffffffae99c5a5 #3 [fffffe6c293d3eb8] exc_nmi at ffffffffae99c7db #4 [fffffe6c293d3ef0] end_repeat_nmi at ffffffffaea017f3 [exception RIP: native_queued_spin_lock_slowpath+63] RIP: ffffffffadd40eff RSP: ffffa1f6853afd00 RFLAGS: 00000002 (interrupt disabled!!) RAX: 0000000000000001 RBX: ffffffffb0ea5804 RCX: ffff91fa1d0aee00 RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffffb0ea5804 RBP: ffffa1f6853afd28 R8: 000000000000525a R9: 000000001d0aee00 R10: 0000000000000000 R11: 0000000000000000 R12: ffffa1f6853afe98 R13: ffff91fd8eeea000 R14: 00000000003d0f00 R15: ffff91fd8eeea000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #5 [ffffa1f6853afd00] native_queued_spin_lock_slowpath at ffffffffadd40eff #6 [ffffa1f6853afd30] _raw_spin_lock_irq at ffffffffae9af19a #7 [ffffa1f6853afd40] cgroup_can_fork at ffffffffaddb0de8 #8 [ffffa1f6853afd88] copy_process at ffffffffadcc1938 #9 [ffffa1f6853afe20] kernel_clone at ffffffffadcc262d #10 [ffffa1f6853afe90] __do_sys_clone at ffffffffadcc2a9d #11 [ffffa1f6853aff10] __x64_sys_clone at ffffffffadcc2ae5 #12 [ffffa1f6853aff20] x64_sys_call at ffffffffadc05579 #13 [ffffa1f6853aff30] do_syscall_64 at ffffffffae99a9e4 #14 [ffffa1f6853aff50] entry_SYSCALL_64_after_hwframe at ffffffffaea00124 RIP: 00007f0d8bcac9f6 RSP: 00007f0cfabfcc38 RFLAGS: 00000206 RAX: ffffffffffffffda RBX: 00007f0cfabfcc90 RCX: 00007f0d8bcac9f6 RDX: 00007f0ced3ff910 RSI: 00007f0ced3feef0 RDI: 00000000003d0f00 RBP: ffffffffffffff80 R8: 00007f0ced3ff640 R9: 00007f0ced3ff640 R10: 00007f0ced3ff910 R11: 0000000000000206 R12: 00007f0ced3ff640 R13: 0000000000000016 R14: 00007f0d8bc1b7d0 R15: 00007f0cfabfcdf0 ORIG_RAX: 0000000000000038 CS: 0033 SS: 002b ``` Environment ``` $ uname -a Linux ip-172-31-16-171 5.15.0-1072-aws #78~20.04.1-Ubuntu SMP Wed Oct 9 15:30:47 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 106 model name : Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz stepping : 6 microcode : 0xd0003e8 cpu MHz : 2900.036 cache size : 55296 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 27 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs mmio_stale_data eibrs_pbrsb gds bhi bogomips : 5800.07 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual ``` We see this very infrequently, but have experienced it on a variety of instanceTypes - r6i.large , r6i.xlarge, r6i.2large at least. Thanks! To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp