[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-11-28 Thread Max Wolffe
Hey Philip,

Thank you for the response. We think we've isolated an eBPF program
we're running which might cause this interaction, I'll see on Monday if
I can get you some more information to help debug.

> 1)  Can you please run the command:
  apport-collect 2089318

Will aim to get you this on Monday when the team resumes investigation.

> 2) Is there anything I can do increase the likelihood of reproducing
this?

I'll see if I can get you a better shape of the data here on Monday as
well which could help with repro.

> 3) The bug title states you hit this on kernel version
5.15.0-1072-aws. Did you hit this on previous kernels, or is this a new
regression that has appeared in the 5.15.0-1072-aws kernel?

We were definitely able to reproduce this as well on 5.15.0-1070-aws,
and we think this has been a latent bug for a while which a recent
deploy may have exposed.

> 4) You state that the same issue occurs on Azure, and GCP. Is that
using the AWS kernel, or the Azure and GCP kernels (respectively)?

These are using cloud kernels respectively. Those are:

Azure - 5.15.0-1075-azure
GCP - 5.15.0-1071-gcp


Thanks again for taking a look - will aim to share more info on Monday.
Happy Thanksgiving!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-02 Thread Max Wolffe
Philip Cox - I think we have an RCA.

Below is the call stack of “iptables” at the moment of the hang (which is same 
across all collected kernel dumps):
```
crash> bt 25894
PID: 25894TASK: 89094bce8000  CPU: 1COMMAND: "iptables"
 #0 [adb9456ab8f8] __schedule at a5ba8b8d
 #1 [adb9456ab980] preempt_schedule_common at a5ba92a8
 #2 [adb9456ab998] __cond_resched at a5ba92e6
 #3 [adb9456ab9a8] down_read at a5bab823
 #4 [adb9456ab9c0] kernfs_walk_and_get_ns at a5248b16
 #5 [adb9456ab9f8] cgroup_get_from_path at a4fa87fa
 #6 [adb9456aba20] cgroup_mt_check_v2 at c07bf083 [xt_cgroup]
 #7 [adb9456aba48] xt_check_match at c01304c1 [x_tables]
 #8 [adb9456abb08] find_check_entry at c014315e [ip_tables]
 #9 [adb9456abbc8] translate_table at c0144429 [ip_tables]
#10 [adb9456abc68] do_ipt_set_ctl at c014579c [ip_tables]
#11 [adb9456abd10] nf_setsockopt at a598d697
#12 [adb9456abd50] ip_setsockopt at a59a140a
#13 [adb9456abd90] raw_setsockopt at a59d44bf
#14 [adb9456abd98] security_socket_setsockopt at a533c5d2
#15 [adb9456abdc8] __sys_setsockopt at a58c1699
#16 [adb9456abe10] __x64_sys_setsockopt at a58c17c5
#17 [adb9456abe20] x64_sys_call at a4e06bab
#18 [adb9456abe30] do_syscall_64 at a5b9a9e4
#19 [adb9456abe88] handle_mm_fault at a51027d8
#20 [adb9456abec8] do_user_addr_fault at a4ea4b40
#21 [adb9456abf00] irqentry_exit_to_user_mode at a5b9f43e
#22 [adb9456abf10] irqentry_exit at a5b9f46d
#23 [adb9456abf18] clear_bhb_loop at a5c018c5
#24 [adb9456abf28] clear_bhb_loop at a5c018c5
#25 [adb9456abf38] clear_bhb_loop at a5c018c5
#26 [adb9456abf50] entry_SYSCALL_64_after_hwframe at a5c00124
RIP: 7f715892496e  RSP: 7ffddb994cf8  RFLAGS: 0206
RAX: ffda  RBX: 5589d9902dc8  RCX: 7f715892496e
RDX: 0040  RSI:   RDI: 0004
RBP: 5589d9909ec0   R8: 3348   R9: 0052
R10: 5589d9909ec0  R11: 0206  R12: 5589d99097d0
R13: 5589d9902dc8  R14: 5589d9902dc0  R15: 5589d9909f20
ORIG_RAX: 0036  CS: 0033  SS: 002b
```

There are two cgroup-related functions on the stack, and the buggy one
is cgroup_get_from_path — it acquires the spinlock and then calls a
function which may cause the current process to sleep.  This leaves the
spinlock locked triggering the subsequent hard lockup.

The good news is that the bug appears to be present briefly within 5.15
kernel — it was first introduced in 5.15.75 and “fixed” in 5.16.1
(https://github.com/torvalds/linux/commit/46307fd6e27a3f678a1678b02e667678c22aa8cc).

So two follow up questions for you at your convenience:

1. Does this RCA seem reasonable / correct to you? 
2. If 1) can Canonical backport this fix to the 5.15 and 5.0.4-fips kernels?
3. If 1) In the mean time, is there a good way for me to find the version of 
the aws Ubuntu kernel which would not contain this issue? In other words - how 
can I translate 5.15.0-1072-aws to 5.15.xx so we can pin the kernel to the 
previous revision - if not too far back?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-09 Thread Max Wolffe
Thank you Philip!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2025-01-06 Thread Max Wolffe
Hey friend - I hope you are well and had good holidays. Just checking in
here to understand when we're likely to be able to pull the fix from
Ubuntu mainline. Thanks in advance!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup in cgroups during eBPF workload

2025-02-04 Thread Max Wolffe
Hey Philip, were we able to get it into the patch for 1/8? Is it still
on track for release Feb 10?

Thanks!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup in cgroups during eBPF workload

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup in cgroups during eBPF workload

2025-02-05 Thread Max Wolffe
No worries, thanks Philip.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup in cgroups during eBPF workload

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-16 Thread Max Wolffe
Totally understand, thank you again for your help Philip!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-17 Thread Max Wolffe
Thank you Philip - we can only reproduce this in our own environment at
high load - so I think it will be hard to reproduce in a small
environment. I will test this today though and confirm the fix, thank
you again for your help :D

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-12 Thread Max Wolffe
Amazing, thank you Philip! Looking forward to testing :)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-13 Thread Max Wolffe
Also - the build seems to work well for x86, but I get the following for
ARM:

```
root@ip-172-31-59-2:/home/ubuntu# apt list | grep 

WARNING: apt does not have a stable CLI interface. Use with caution in
scripts.

invesalius-bin/focal 3.1.2-3build2 arm64
invesalius-examples/focal 3.1.2-3build2 all
invesalius/focal 3.1.2-3build2 all
libfile-slurp-perl/focal .29-1 all
linux-aws-5.15-headers-5.15.0-/focal 5.15.0-.99~20.04.2 all
ruby-odbc/focal 0.8-1build2 arm64
sqitch/focal 0.-2 all
```

I see that the headers were published, but perhaps the rest of the
binary was not.

Thanks again for your help!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-13 Thread Max Wolffe
Thank you sir. Yes, that understanding is correct:

We're running Ubuntu 20.04 - 5.15 for Azure and GCP both.

And just so I understand - once this is merged into the Ubuntu 5.15
branch, we'll likely be able to pick it up from the official ubuntu
source in the new year?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-13 Thread Max Wolffe
Thank you Philip - we're testing this now, but the building is looking
promising so far.

Would it be possible for us to get similar packages in azure/gcp as
well?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-12 Thread Max Wolffe
Hey Philip, this is great thank you - is there any way we could get this
for Focal?

Or is there an easy way for me to install this kernel for focal from
your PPA?

When I list focal releases for your ppa I get the following:
```
> sudo apt --allow-unauthenticated update
...
Err:17 http://ppa.launchpad.net/philcox/lp2089318-kernel-hard-lockup/ubuntu 
focal Release
  404  Not Found [IP: 185.125.190.80 80]
...
```

Thanks again for the work making this available!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-20 Thread Max Wolffe
Philip - we have no reports of kernel hangs in our staging environment
since the deployment. I think we can consider the patch to have fixed
the issue.

Thank you again for your hard work getting this patched! Have a lovely
holiday season.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

2024-12-19 Thread Max Wolffe
Thanks for your patience on this testing. Just confirming here that
Azure, GCP, AWS all built correctly with your change, we deployed the
fix this afternoon and are monitoring for any issues - will report
results tomorrow morning.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2089318] [NEW] kernel hard lockup 5.15.0-1072-aws

2024-11-21 Thread Max Wolffe
Public bug reported:

Hi friends,

We hit a kernel hard lockup where all CPUs are stuck acquiring an
already-locked spinlock (css_set_lock) within the cgroup subsystem.
Below are the call stacks from a memory dump of a two-core system taken
on Ubuntu 20.04 (5.15 kernel) on AWS, but the same issue occurs on Azure
and GCP too.  This is happening in a non-deterministic fashion (less
than 1%), and can occur at any time of the VM execution.  We suspect
it’s a deadlock triggered by some race condition, but we don’t know for
sure.

```
PID: 21079TASK: 91fdcd1dc000  CPU: 0COMMAND: "sh"
 #0 [fe7127850cb8] machine_kexec at adc92680
 #1 [fe7127850d18] __crash_kexec at adda0b9f
 #2 [fe7127850de0] panic at ae8f56be
 #3 [fe7127850e70] unknown_nmi_error.cold at ae8eb4c8
 #4 [fe7127850e90] default_do_nmi at ae99c639
 #5 [fe7127850eb8] exc_nmi at ae99c7db
 #6 [fe7127850ef0] end_repeat_nmi at aea017f3
[exception RIP: native_queued_spin_lock_slowpath+63]
RIP: add40eff  RSP: a1f68589fc60  RFLAGS: 0002 (interrupt 
disabled!!)
RAX: 0001  RBX: b0ea5804  RCX: 91fb597c8980
RDX: 0001  RSI: 0001  RDI: b0ea5804
RBP: a1f68589fc88   R8: 5259   R9: 597c8980
R10:   R11:   R12: a1f68589fdf8
R13: 91fdcd1d8000  R14: 4100  R15: 91fdcd1d8000
ORIG_RAX:   CS: 0010  SS: 0018
---  ---
 #7 [a1f68589fc60] native_queued_spin_lock_slowpath at add40eff
 #8 [a1f68589fc90] _raw_spin_lock_irq at ae9af19a
 #9 [a1f68589fca0] cgroup_can_fork at addb0de8
#10 [a1f68589fce8] copy_process at adcc1938
#11 [a1f68589fcf0] filemap_map_pages at adeb68db
#12 [a1f68589fdf0] __x64_sys_vfork at adcc2a20
#13 [a1f68589fe70] x64_sys_call at adc068a9
#14 [a1f68589fe80] do_syscall_64 at ae99a9e4
#15 [a1f68589fec0] exit_to_user_mode_prepare at add725ad
#16 [a1f68589ff00] irqentry_exit_to_user_mode at ae99f43e
#17 [a1f68589ff10] irqentry_exit at ae99f46d
#18 [a1f68589ff18] clear_bhb_loop at aea018c5
#19 [a1f68589ff28] clear_bhb_loop at aea018c5
#20 [a1f68589ff38] clear_bhb_loop at aea018c5
#21 [a1f68589ff50] entry_SYSCALL_64_after_hwframe at aea00124
RIP: 7fddfa4cebcc  RSP: 7fffaa741990  RFLAGS: 0202
RAX: ffda  RBX: 55ea66750428  RCX: 7fddfa4cebcc
RDX:   RSI: 7fffaa7419c0  RDI: 55ea663c8866
RBP: 0003   R8: 7fffaa7419c0   R9: 55ea667505f0
R10: 0008  R11: 0202  R12: 7fffaa7419c0
R13: 7fffaa741ae0  R14:   R15: 55ea663de810
ORIG_RAX: 003a  CS: 0033  SS: 002b


PID: 20304TASK: 91fb0544  CPU: 1COMMAND: "Writer:Driver>C"
 #0 [fe6c293d3e10] crash_nmi_callback at adc81ec0
 #1 [fe6c293d3e48] nmi_handle at adc49b03
 #2 [fe6c293d3e90] default_do_nmi at ae99c5a5
 #3 [fe6c293d3eb8] exc_nmi at ae99c7db
 #4 [fe6c293d3ef0] end_repeat_nmi at aea017f3
[exception RIP: native_queued_spin_lock_slowpath+63]
RIP: add40eff  RSP: a1f6853afd00  RFLAGS: 0002 (interrupt 
disabled!!)
RAX: 0001  RBX: b0ea5804  RCX: 91fa1d0aee00
RDX: 0001  RSI: 0001  RDI: b0ea5804
RBP: a1f6853afd28   R8: 525a   R9: 1d0aee00
R10:   R11:   R12: a1f6853afe98
R13: 91fd8eeea000  R14: 003d0f00  R15: 91fd8eeea000
ORIG_RAX:   CS: 0010  SS: 0018
---  ---
 #5 [a1f6853afd00] native_queued_spin_lock_slowpath at add40eff
 #6 [a1f6853afd30] _raw_spin_lock_irq at ae9af19a
 #7 [a1f6853afd40] cgroup_can_fork at addb0de8
 #8 [a1f6853afd88] copy_process at adcc1938
 #9 [a1f6853afe20] kernel_clone at adcc262d
#10 [a1f6853afe90] __do_sys_clone at adcc2a9d
#11 [a1f6853aff10] __x64_sys_clone at adcc2ae5
#12 [a1f6853aff20] x64_sys_call at adc05579
#13 [a1f6853aff30] do_syscall_64 at ae99a9e4
#14 [a1f6853aff50] entry_SYSCALL_64_after_hwframe at aea00124
RIP: 7f0d8bcac9f6  RSP: 7f0cfabfcc38  RFLAGS: 0206
RAX: ffda  RBX: 7f0cfabfcc90  RCX: 7f0d8bcac9f6
RDX: 7f0ced3ff910  RSI: 7f0ced3feef0  RDI: 003d0f00
RBP: ff80   R8: 7f0ced3ff640   R9: 7f0ced3ff640
R10: 7f0ced3ff910  R11: 0206  R12: 7f0ced3ff640
R13: 0016  R14: 7f0d8bc1b7d0  R15: 7f0cfabfcdf0
ORIG_RAX: 0038  CS: 0033  SS: 002b
```

Enviro