[Kernel-packages] [Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

Philip Cox Tue, 17 Dec 2024 10:58:35 -0800

Not a problem.  Once you confirm, I will submit the patches to the
generic Ubuntu kernel, and it will land in the gcp/aws/azure kernels via
the regular SRU update route.  I will update the ticket when I've sent
out the review for the patches.


-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

Status in linux package in Ubuntu:
  Triaged
Status in linux-aws-5.15 package in Ubuntu:
  In Progress
Status in linux source package in Focal:
  New
Status in linux-aws-5.15 source package in Focal:
  New

Bug description:
  Hi friends,

  We hit a kernel hard lockup where all CPUs are stuck acquiring an
  already-locked spinlock (css_set_lock) within the cgroup subsystem.
  Below are the call stacks from a memory dump of a two-core system
  taken on Ubuntu 20.04 (5.15 kernel) on AWS, but the same issue occurs
  on Azure and GCP too.  This is happening in a non-deterministic
  fashion (less than 1%), and can occur at any time of the VM execution.
  We suspect it’s a deadlock triggered by some race condition, but we
  don’t know for sure.

  ```
  PID: 21079    TASK: ffff91fdcd1dc000  CPU: 0    COMMAND: "sh"
   #0 [fffffe7127850cb8] machine_kexec at ffffffffadc92680
   #1 [fffffe7127850d18] __crash_kexec at ffffffffadda0b9f
   #2 [fffffe7127850de0] panic at ffffffffae8f56be
   #3 [fffffe7127850e70] unknown_nmi_error.cold at ffffffffae8eb4c8
   #4 [fffffe7127850e90] default_do_nmi at ffffffffae99c639
   #5 [fffffe7127850eb8] exc_nmi at ffffffffae99c7db
   #6 [fffffe7127850ef0] end_repeat_nmi at ffffffffaea017f3
      [exception RIP: native_queued_spin_lock_slowpath+63]
      RIP: ffffffffadd40eff  RSP: ffffa1f68589fc60  RFLAGS: 00000002 (interrupt 
disabled!!)
      RAX: 0000000000000001  RBX: ffffffffb0ea5804  RCX: ffff91fb597c8980
      RDX: 0000000000000001  RSI: 0000000000000001  RDI: ffffffffb0ea5804
      RBP: ffffa1f68589fc88   R8: 0000000000005259   R9: 00000000597c8980
      R10: 0000000000000000  R11: 0000000000000000  R12: ffffa1f68589fdf8
      R13: ffff91fdcd1d8000  R14: 0000000000004100  R15: ffff91fdcd1d8000
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   #7 [ffffa1f68589fc60] native_queued_spin_lock_slowpath at ffffffffadd40eff
   #8 [ffffa1f68589fc90] _raw_spin_lock_irq at ffffffffae9af19a
   #9 [ffffa1f68589fca0] cgroup_can_fork at ffffffffaddb0de8
  #10 [ffffa1f68589fce8] copy_process at ffffffffadcc1938
  #11 [ffffa1f68589fcf0] filemap_map_pages at ffffffffadeb68db
  #12 [ffffa1f68589fdf0] __x64_sys_vfork at ffffffffadcc2a20
  #13 [ffffa1f68589fe70] x64_sys_call at ffffffffadc068a9
  #14 [ffffa1f68589fe80] do_syscall_64 at ffffffffae99a9e4
  #15 [ffffa1f68589fec0] exit_to_user_mode_prepare at ffffffffadd725ad
  #16 [ffffa1f68589ff00] irqentry_exit_to_user_mode at ffffffffae99f43e
  #17 [ffffa1f68589ff10] irqentry_exit at ffffffffae99f46d
  #18 [ffffa1f68589ff18] clear_bhb_loop at ffffffffaea018c5
  #19 [ffffa1f68589ff28] clear_bhb_loop at ffffffffaea018c5
  #20 [ffffa1f68589ff38] clear_bhb_loop at ffffffffaea018c5
  #21 [ffffa1f68589ff50] entry_SYSCALL_64_after_hwframe at ffffffffaea00124
      RIP: 00007fddfa4cebcc  RSP: 00007fffaa741990  RFLAGS: 00000202
      RAX: ffffffffffffffda  RBX: 000055ea66750428  RCX: 00007fddfa4cebcc
      RDX: 0000000000000000  RSI: 00007fffaa7419c0  RDI: 000055ea663c8866
      RBP: 0000000000000003   R8: 00007fffaa7419c0   R9: 000055ea667505f0
      R10: 0000000000000008  R11: 0000000000000202  R12: 00007fffaa7419c0
      R13: 00007fffaa741ae0  R14: 0000000000000000  R15: 000055ea663de810
      ORIG_RAX: 000000000000003a  CS: 0033  SS: 002b

  
  PID: 20304    TASK: ffff91fb05440000  CPU: 1    COMMAND: "Writer:Driver>C"
   #0 [fffffe6c293d3e10] crash_nmi_callback at ffffffffadc81ec0
   #1 [fffffe6c293d3e48] nmi_handle at ffffffffadc49b03
   #2 [fffffe6c293d3e90] default_do_nmi at ffffffffae99c5a5
   #3 [fffffe6c293d3eb8] exc_nmi at ffffffffae99c7db
   #4 [fffffe6c293d3ef0] end_repeat_nmi at ffffffffaea017f3
      [exception RIP: native_queued_spin_lock_slowpath+63]
      RIP: ffffffffadd40eff  RSP: ffffa1f6853afd00  RFLAGS: 00000002 (interrupt 
disabled!!)
      RAX: 0000000000000001  RBX: ffffffffb0ea5804  RCX: ffff91fa1d0aee00
      RDX: 0000000000000001  RSI: 0000000000000001  RDI: ffffffffb0ea5804
      RBP: ffffa1f6853afd28   R8: 000000000000525a   R9: 000000001d0aee00
      R10: 0000000000000000  R11: 0000000000000000  R12: ffffa1f6853afe98
      R13: ffff91fd8eeea000  R14: 00000000003d0f00  R15: ffff91fd8eeea000
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   #5 [ffffa1f6853afd00] native_queued_spin_lock_slowpath at ffffffffadd40eff
   #6 [ffffa1f6853afd30] _raw_spin_lock_irq at ffffffffae9af19a
   #7 [ffffa1f6853afd40] cgroup_can_fork at ffffffffaddb0de8
   #8 [ffffa1f6853afd88] copy_process at ffffffffadcc1938
   #9 [ffffa1f6853afe20] kernel_clone at ffffffffadcc262d
  #10 [ffffa1f6853afe90] __do_sys_clone at ffffffffadcc2a9d
  #11 [ffffa1f6853aff10] __x64_sys_clone at ffffffffadcc2ae5
  #12 [ffffa1f6853aff20] x64_sys_call at ffffffffadc05579
  #13 [ffffa1f6853aff30] do_syscall_64 at ffffffffae99a9e4
  #14 [ffffa1f6853aff50] entry_SYSCALL_64_after_hwframe at ffffffffaea00124
      RIP: 00007f0d8bcac9f6  RSP: 00007f0cfabfcc38  RFLAGS: 00000206
      RAX: ffffffffffffffda  RBX: 00007f0cfabfcc90  RCX: 00007f0d8bcac9f6
      RDX: 00007f0ced3ff910  RSI: 00007f0ced3feef0  RDI: 00000000003d0f00
      RBP: ffffffffffffff80   R8: 00007f0ced3ff640   R9: 00007f0ced3ff640
      R10: 00007f0ced3ff910  R11: 0000000000000206  R12: 00007f0ced3ff640
      R13: 0000000000000016  R14: 00007f0d8bc1b7d0  R15: 00007f0cfabfcdf0
      ORIG_RAX: 0000000000000038  CS: 0033  SS: 002b
  ```

  Environment

  ```
  $ uname -a
  Linux ip-172-31-16-171 5.15.0-1072-aws #78~20.04.1-Ubuntu SMP Wed Oct 9 
15:30:47 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  $ cat /proc/cpuinfo
  processor       : 0
  vendor_id       : GenuineIntel
  cpu family      : 6
  model           : 106
  model name      : Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
  stepping        : 6
  microcode       : 0xd0003e8
  cpu MHz         : 2900.036
  cache size      : 55296 KB
  physical id     : 0
  siblings        : 8
  core id         : 0
  cpu cores       : 4
  apicid          : 0
  initial apicid  : 0
  fpu             : yes
  fpu_exception   : yes
  cpuid level     : 27
  wp              : yes
  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase 
tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap 
avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec 
xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes 
vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear 
flush_l1d arch_capabilities
  bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs 
mmio_stale_data eibrs_pbrsb gds bhi
  bogomips        : 5800.07
  clflush size    : 64
  cache_alignment : 64
  address sizes   : 46 bits physical, 48 bits virtual
  ```

  We see this very infrequently, but have experienced it on a variety of
  instanceTypes - r6i.large , r6i.xlarge, r6i.2large at least.

  Thanks!

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

Reply via email to