Thank you all for your ideas!

Sure, we do have some modules not from the kernel source tree. These are
Mellanox (our NICs) and OpenvSwitch, as we've had some problems that
were fixed in the newer driver versions.

We don't have apport enabled, and actually, the hypervisor nodes don't even 
have direct access to the internet (only some VMs on them).
I checked on a test VM what kind of info it collects, and it seems that these 
are the arch, kernel version, and the stack trace. That kind of info is 
attached manually, we have netconsole enabled that collected it.

When the issue started, it was even reproducible on the then-latest
kernel (5.4.0-66), so I'm not sure that simply upgrading can help.

Currently I'm working on integrating kdump into our infrastructure,
trying to reproduce again, and I'll also try to schedule migration +
upgrade for our hypervisor node (that's not fast though).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1921355

Title:
  cgroups related kernel panics

Status in linux package in Ubuntu:
  Incomplete
Status in linux-hwe-5.4 package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Recently (throughout the last 6 months) we've upgraded our hypervisor
  compute hosts from ubuntu bionic kernel 4.15.* to ubuntu bionic hwe
  kernel 5.4.

  This month we noticed that several nodes failed due to bugs in cgroups.
  Trace was different almost every time, but it all revolves around cgroups - 
either null pointer failures, or panic caught by BUG_ON() macro. Looked like 
some cgroup didn't exist anymore but somebody tried to access it, thus causing 
kernel panic.
  Please find the logs attached.

  3 of 4 cases happened after a VM shutdown. We tried to spawn lots of VMs, 
load them, shut them down, but didn't manage to reproduce the behavior.
  Actually, every case is sort of different - patch kernel versions (5.4.0-42 
to 5.4.0-66), uptime vary (from 1 day to ~half a year). There are also lots of 
hosts with several months of uptime, no issue with them. Also, on 4.15 we've 
never seen this behavior, at all.
  That's quite disturbing, as I don't want dozens of VMs crash (due to host 
outage) at random times for some vague reason...
  I didn't manage to find any related bugs on the bug tracker, thus creating 
this one.

  I wonder if anybody in the community came across something like that.
  Could somebody give an advice how to debug further, or where else to report / 
look for a similar the case?

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1921355/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to