[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Hey Philip, Thank you for the response. We think we've isolated an eBPF program we're running which might cause this interaction, I'll see on Monday if I can get you some more information to help debug. > 1) Can you please run the command: apport-collect 2089318 Will aim to get you this on Monday when the team resumes investigation. > 2) Is there anything I can do increase the likelihood of reproducing this? I'll see if I can get you a better shape of the data here on Monday as well which could help with repro. > 3) The bug title states you hit this on kernel version 5.15.0-1072-aws. Did you hit this on previous kernels, or is this a new regression that has appeared in the 5.15.0-1072-aws kernel? We were definitely able to reproduce this as well on 5.15.0-1070-aws, and we think this has been a latent bug for a while which a recent deploy may have exposed. > 4) You state that the same issue occurs on Azure, and GCP. Is that using the AWS kernel, or the Azure and GCP kernels (respectively)? These are using cloud kernels respectively. Those are: Azure - 5.15.0-1075-azure GCP - 5.15.0-1071-gcp Thanks again for taking a look - will aim to share more info on Monday. Happy Thanksgiving! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Philip Cox - I think we have an RCA. Below is the call stack of “iptables” at the moment of the hang (which is same across all collected kernel dumps): ``` crash> bt 25894 PID: 25894TASK: 89094bce8000 CPU: 1COMMAND: "iptables" #0 [adb9456ab8f8] __schedule at a5ba8b8d #1 [adb9456ab980] preempt_schedule_common at a5ba92a8 #2 [adb9456ab998] __cond_resched at a5ba92e6 #3 [adb9456ab9a8] down_read at a5bab823 #4 [adb9456ab9c0] kernfs_walk_and_get_ns at a5248b16 #5 [adb9456ab9f8] cgroup_get_from_path at a4fa87fa #6 [adb9456aba20] cgroup_mt_check_v2 at c07bf083 [xt_cgroup] #7 [adb9456aba48] xt_check_match at c01304c1 [x_tables] #8 [adb9456abb08] find_check_entry at c014315e [ip_tables] #9 [adb9456abbc8] translate_table at c0144429 [ip_tables] #10 [adb9456abc68] do_ipt_set_ctl at c014579c [ip_tables] #11 [adb9456abd10] nf_setsockopt at a598d697 #12 [adb9456abd50] ip_setsockopt at a59a140a #13 [adb9456abd90] raw_setsockopt at a59d44bf #14 [adb9456abd98] security_socket_setsockopt at a533c5d2 #15 [adb9456abdc8] __sys_setsockopt at a58c1699 #16 [adb9456abe10] __x64_sys_setsockopt at a58c17c5 #17 [adb9456abe20] x64_sys_call at a4e06bab #18 [adb9456abe30] do_syscall_64 at a5b9a9e4 #19 [adb9456abe88] handle_mm_fault at a51027d8 #20 [adb9456abec8] do_user_addr_fault at a4ea4b40 #21 [adb9456abf00] irqentry_exit_to_user_mode at a5b9f43e #22 [adb9456abf10] irqentry_exit at a5b9f46d #23 [adb9456abf18] clear_bhb_loop at a5c018c5 #24 [adb9456abf28] clear_bhb_loop at a5c018c5 #25 [adb9456abf38] clear_bhb_loop at a5c018c5 #26 [adb9456abf50] entry_SYSCALL_64_after_hwframe at a5c00124 RIP: 7f715892496e RSP: 7ffddb994cf8 RFLAGS: 0206 RAX: ffda RBX: 5589d9902dc8 RCX: 7f715892496e RDX: 0040 RSI: RDI: 0004 RBP: 5589d9909ec0 R8: 3348 R9: 0052 R10: 5589d9909ec0 R11: 0206 R12: 5589d99097d0 R13: 5589d9902dc8 R14: 5589d9902dc0 R15: 5589d9909f20 ORIG_RAX: 0036 CS: 0033 SS: 002b ``` There are two cgroup-related functions on the stack, and the buggy one is cgroup_get_from_path — it acquires the spinlock and then calls a function which may cause the current process to sleep. This leaves the spinlock locked triggering the subsequent hard lockup. The good news is that the bug appears to be present briefly within 5.15 kernel — it was first introduced in 5.15.75 and “fixed” in 5.16.1 (https://github.com/torvalds/linux/commit/46307fd6e27a3f678a1678b02e667678c22aa8cc). So two follow up questions for you at your convenience: 1. Does this RCA seem reasonable / correct to you? 2. If 1) can Canonical backport this fix to the 5.15 and 5.0.4-fips kernels? 3. If 1) In the mean time, is there a good way for me to find the version of the aws Ubuntu kernel which would not contain this issue? In other words - how can I translate 5.15.0-1072-aws to 5.15.xx so we can pin the kernel to the previous revision - if not too far back? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Thank you Philip! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Hey friend - I hope you are well and had good holidays. Just checking in here to understand when we're likely to be able to pull the fix from Ubuntu mainline. Thanks in advance! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup in cgroups during eBPF workload
Hey Philip, were we able to get it into the patch for 1/8? Is it still on track for release Feb 10? Thanks! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup in cgroups during eBPF workload To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup in cgroups during eBPF workload
No worries, thanks Philip. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup in cgroups during eBPF workload To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Totally understand, thank you again for your help Philip! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Thank you Philip - we can only reproduce this in our own environment at high load - so I think it will be hard to reproduce in a small environment. I will test this today though and confirm the fix, thank you again for your help :D -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Amazing, thank you Philip! Looking forward to testing :) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Also - the build seems to work well for x86, but I get the following for ARM: ``` root@ip-172-31-59-2:/home/ubuntu# apt list | grep WARNING: apt does not have a stable CLI interface. Use with caution in scripts. invesalius-bin/focal 3.1.2-3build2 arm64 invesalius-examples/focal 3.1.2-3build2 all invesalius/focal 3.1.2-3build2 all libfile-slurp-perl/focal .29-1 all linux-aws-5.15-headers-5.15.0-/focal 5.15.0-.99~20.04.2 all ruby-odbc/focal 0.8-1build2 arm64 sqitch/focal 0.-2 all ``` I see that the headers were published, but perhaps the rest of the binary was not. Thanks again for your help! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Thank you sir. Yes, that understanding is correct: We're running Ubuntu 20.04 - 5.15 for Azure and GCP both. And just so I understand - once this is merged into the Ubuntu 5.15 branch, we'll likely be able to pick it up from the official ubuntu source in the new year? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Thank you Philip - we're testing this now, but the building is looking promising so far. Would it be possible for us to get similar packages in azure/gcp as well? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Hey Philip, this is great thank you - is there any way we could get this for Focal? Or is there an easy way for me to install this kernel for focal from your PPA? When I list focal releases for your ppa I get the following: ``` > sudo apt --allow-unauthenticated update ... Err:17 http://ppa.launchpad.net/philcox/lp2089318-kernel-hard-lockup/ubuntu focal Release 404 Not Found [IP: 185.125.190.80 80] ... ``` Thanks again for the work making this available! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Philip - we have no reports of kernel hangs in our staging environment since the deployment. I think we can consider the patch to have fixed the issue. Thank you again for your hard work getting this patched! Have a lovely holiday season. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws
Thanks for your patience on this testing. Just confirming here that Azure, GCP, AWS all built correctly with your change, we deployed the fix this afternoon and are monitoring for any issues - will report results tomorrow morning. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2089318 Title: kernel hard lockup 5.15.0-1072-aws To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2089318] [NEW] kernel hard lockup 5.15.0-1072-aws
Public bug reported: Hi friends, We hit a kernel hard lockup where all CPUs are stuck acquiring an already-locked spinlock (css_set_lock) within the cgroup subsystem. Below are the call stacks from a memory dump of a two-core system taken on Ubuntu 20.04 (5.15 kernel) on AWS, but the same issue occurs on Azure and GCP too. This is happening in a non-deterministic fashion (less than 1%), and can occur at any time of the VM execution. We suspect it’s a deadlock triggered by some race condition, but we don’t know for sure. ``` PID: 21079TASK: 91fdcd1dc000 CPU: 0COMMAND: "sh" #0 [fe7127850cb8] machine_kexec at adc92680 #1 [fe7127850d18] __crash_kexec at adda0b9f #2 [fe7127850de0] panic at ae8f56be #3 [fe7127850e70] unknown_nmi_error.cold at ae8eb4c8 #4 [fe7127850e90] default_do_nmi at ae99c639 #5 [fe7127850eb8] exc_nmi at ae99c7db #6 [fe7127850ef0] end_repeat_nmi at aea017f3 [exception RIP: native_queued_spin_lock_slowpath+63] RIP: add40eff RSP: a1f68589fc60 RFLAGS: 0002 (interrupt disabled!!) RAX: 0001 RBX: b0ea5804 RCX: 91fb597c8980 RDX: 0001 RSI: 0001 RDI: b0ea5804 RBP: a1f68589fc88 R8: 5259 R9: 597c8980 R10: R11: R12: a1f68589fdf8 R13: 91fdcd1d8000 R14: 4100 R15: 91fdcd1d8000 ORIG_RAX: CS: 0010 SS: 0018 --- --- #7 [a1f68589fc60] native_queued_spin_lock_slowpath at add40eff #8 [a1f68589fc90] _raw_spin_lock_irq at ae9af19a #9 [a1f68589fca0] cgroup_can_fork at addb0de8 #10 [a1f68589fce8] copy_process at adcc1938 #11 [a1f68589fcf0] filemap_map_pages at adeb68db #12 [a1f68589fdf0] __x64_sys_vfork at adcc2a20 #13 [a1f68589fe70] x64_sys_call at adc068a9 #14 [a1f68589fe80] do_syscall_64 at ae99a9e4 #15 [a1f68589fec0] exit_to_user_mode_prepare at add725ad #16 [a1f68589ff00] irqentry_exit_to_user_mode at ae99f43e #17 [a1f68589ff10] irqentry_exit at ae99f46d #18 [a1f68589ff18] clear_bhb_loop at aea018c5 #19 [a1f68589ff28] clear_bhb_loop at aea018c5 #20 [a1f68589ff38] clear_bhb_loop at aea018c5 #21 [a1f68589ff50] entry_SYSCALL_64_after_hwframe at aea00124 RIP: 7fddfa4cebcc RSP: 7fffaa741990 RFLAGS: 0202 RAX: ffda RBX: 55ea66750428 RCX: 7fddfa4cebcc RDX: RSI: 7fffaa7419c0 RDI: 55ea663c8866 RBP: 0003 R8: 7fffaa7419c0 R9: 55ea667505f0 R10: 0008 R11: 0202 R12: 7fffaa7419c0 R13: 7fffaa741ae0 R14: R15: 55ea663de810 ORIG_RAX: 003a CS: 0033 SS: 002b PID: 20304TASK: 91fb0544 CPU: 1COMMAND: "Writer:Driver>C" #0 [fe6c293d3e10] crash_nmi_callback at adc81ec0 #1 [fe6c293d3e48] nmi_handle at adc49b03 #2 [fe6c293d3e90] default_do_nmi at ae99c5a5 #3 [fe6c293d3eb8] exc_nmi at ae99c7db #4 [fe6c293d3ef0] end_repeat_nmi at aea017f3 [exception RIP: native_queued_spin_lock_slowpath+63] RIP: add40eff RSP: a1f6853afd00 RFLAGS: 0002 (interrupt disabled!!) RAX: 0001 RBX: b0ea5804 RCX: 91fa1d0aee00 RDX: 0001 RSI: 0001 RDI: b0ea5804 RBP: a1f6853afd28 R8: 525a R9: 1d0aee00 R10: R11: R12: a1f6853afe98 R13: 91fd8eeea000 R14: 003d0f00 R15: 91fd8eeea000 ORIG_RAX: CS: 0010 SS: 0018 --- --- #5 [a1f6853afd00] native_queued_spin_lock_slowpath at add40eff #6 [a1f6853afd30] _raw_spin_lock_irq at ae9af19a #7 [a1f6853afd40] cgroup_can_fork at addb0de8 #8 [a1f6853afd88] copy_process at adcc1938 #9 [a1f6853afe20] kernel_clone at adcc262d #10 [a1f6853afe90] __do_sys_clone at adcc2a9d #11 [a1f6853aff10] __x64_sys_clone at adcc2ae5 #12 [a1f6853aff20] x64_sys_call at adc05579 #13 [a1f6853aff30] do_syscall_64 at ae99a9e4 #14 [a1f6853aff50] entry_SYSCALL_64_after_hwframe at aea00124 RIP: 7f0d8bcac9f6 RSP: 7f0cfabfcc38 RFLAGS: 0206 RAX: ffda RBX: 7f0cfabfcc90 RCX: 7f0d8bcac9f6 RDX: 7f0ced3ff910 RSI: 7f0ced3feef0 RDI: 003d0f00 RBP: ff80 R8: 7f0ced3ff640 R9: 7f0ced3ff640 R10: 7f0ced3ff910 R11: 0206 R12: 7f0ced3ff640 R13: 0016 R14: 7f0d8bc1b7d0 R15: 7f0cfabfcdf0 ORIG_RAX: 0038 CS: 0033 SS: 002b ``` Enviro