Thank you all for your ideas! Sure, we do have some modules not from the kernel source tree. These are Mellanox (our NICs) and OpenvSwitch, as we've had some problems that were fixed in the newer driver versions.
We don't have apport enabled, and actually, the hypervisor nodes don't even have direct access to the internet (only some VMs on them). I checked on a test VM what kind of info it collects, and it seems that these are the arch, kernel version, and the stack trace. That kind of info is attached manually, we have netconsole enabled that collected it. When the issue started, it was even reproducible on the then-latest kernel (5.4.0-66), so I'm not sure that simply upgrading can help. Currently I'm working on integrating kdump into our infrastructure, trying to reproduce again, and I'll also try to schedule migration + upgrade for our hypervisor node (that's not fast though). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1921355 Title: cgroups related kernel panics Status in linux package in Ubuntu: Incomplete Status in linux-hwe-5.4 package in Ubuntu: Confirmed Bug description: Hi! Recently (throughout the last 6 months) we've upgraded our hypervisor compute hosts from ubuntu bionic kernel 4.15.* to ubuntu bionic hwe kernel 5.4. This month we noticed that several nodes failed due to bugs in cgroups. Trace was different almost every time, but it all revolves around cgroups - either null pointer failures, or panic caught by BUG_ON() macro. Looked like some cgroup didn't exist anymore but somebody tried to access it, thus causing kernel panic. Please find the logs attached. 3 of 4 cases happened after a VM shutdown. We tried to spawn lots of VMs, load them, shut them down, but didn't manage to reproduce the behavior. Actually, every case is sort of different - patch kernel versions (5.4.0-42 to 5.4.0-66), uptime vary (from 1 day to ~half a year). There are also lots of hosts with several months of uptime, no issue with them. Also, on 4.15 we've never seen this behavior, at all. That's quite disturbing, as I don't want dozens of VMs crash (due to host outage) at random times for some vague reason... I didn't manage to find any related bugs on the bug tracker, thus creating this one. I wonder if anybody in the community came across something like that. Could somebody give an advice how to debug further, or where else to report / look for a similar the case? To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1921355/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp