Cornered this to zswap and not an issue with mm or I/O. Figured out that 3 hours soak testing on each bisect step is the only reliably way to do a bisect. Bisected between 4.20 and 5.0 finally cornered the issue and hence the commits required to fix this.
** Description changed: + == SRU Justification == + + When using zram (as installed and configured with the zram-config package) + systems can lockup after about a week of use. This occurs because of + a hang in a lock in zram. + + == Test Case == + + Run stress-ng --brk 0 --stack 0 in a Bionic amd64 server VM with 1GM of + memory, 16 CPU threads and zram-config installed. Without the fix the + kernel will hang in a spinlock after 1-2 hours of run time. With the fix, + the hang does not occur. Testing shows that with the fix, 5 x 16 CPU hours + of stress testing with stress-ng works fine without the lockup occurring. + + == The fix == + + Upstream commit c4d6c4cc7bfd ("zram: correct flag name of ZRAM_ACCESS") as + a prerequisite followed by a minor context wiggle backport of the fix with + commit 3c9959e02547 ("zram: fix lockdep warning of free block handling"). + + == Regression Potential == + + This touches the zram locking, so the core zram driver is affected. However + the fixes are backports from 5.0, so the fixes have had a fair amount of + testing in later kernels. + + My main server has been running into hard lockups about once a week ever since I switched to the 4.15 Ubuntu 18.04 kernel. When this happens, nothing is printed to the console, it's effectively stuck showing a login prompt. The system is running with panic=1 on the cmdline but isn't rebooting so the kernel isn't even processing this as a kernel panic. - - As this felt like a potential hardware issue, I had my hosting provider give me a completely different system, different motherboard, different CPU, different RAM and different storage, I installed that system on 18.04 and moved my data over, a week later, I hit the issue again. + As this felt like a potential hardware issue, I had my hosting provider + give me a completely different system, different motherboard, different + CPU, different RAM and different storage, I installed that system on + 18.04 and moved my data over, a week later, I hit the issue again. We've since also had a LXD user reporting similar symptoms here also on varying hardware: - https://github.com/lxc/lxd/issues/5197 + https://github.com/lxc/lxd/issues/5197 - - My system doesn't have a lot of memory pressure with about 50% of free memory: + My system doesn't have a lot of memory pressure with about 50% of free + memory: root@vorash:~# free -m - total used free shared buff/cache available + total used free shared buff/cache available Mem: 31819 17574 402 513 13842 13292 Swap: 15909 2687 13222 I will now try to increase console logging as much as possible on the system in the hopes that next time it hangs we can get a better idea of what happened but I'm not too hopeful given the complete silence on the console when this occurs. System is currently on: - Linux vorash 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux + Linux vorash 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux But I've seen this since the GA kernel on 4.15 so it's not a recent regression. - --- + --- ProblemType: Bug AlsaDevices: - total 0 - crw-rw---- 1 root audio 116, 1 Oct 23 16:12 seq - crw-rw---- 1 root audio 116, 33 Oct 23 16:12 timer + total 0 + crw-rw---- 1 root audio 116, 1 Oct 23 16:12 seq + crw-rw---- 1 root audio 116, 33 Oct 23 16:12 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.4 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: - Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/22822/fd/10: Permission denied - Cannot stat file /proc/22831/fd/10: Permission denied + Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/22822/fd/10: Permission denied + Cannot stat file /proc/22831/fd/10: Permission denied DistroRelease: Ubuntu 18.04 HibernationDevice: - RESUME=none - CRYPTSETUP=n + RESUME=none + CRYPTSETUP=n IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig' Lsusb: - Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub - Bus 001 Device 002: ID 046b:ff10 American Megatrends, Inc. Virtual Keyboard and Mouse - Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub + Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub + Bus 001 Device 002: ID 046b:ff10 American Megatrends, Inc. Virtual Keyboard and Mouse + Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Intel Corporation S1200SP NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair Package: linux (not installed) PciMultimedia: - + ProcEnviron: - TERM=xterm - PATH=(custom, no user) - XDG_RUNTIME_DIR=<set> - LANG=en_US.UTF-8 - SHELL=/bin/bash + TERM=xterm + PATH=(custom, no user) + XDG_RUNTIME_DIR=<set> + LANG=en_US.UTF-8 + SHELL=/bin/bash ProcFB: 0 mgadrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-38-generic root=UUID=575c878a-0be6-4806-9c83-28f67aedea65 ro biosdevname=0 net.ifnames=0 panic=1 verbose console=tty0 console=ttyS0,115200n8 ProcVersionSignature: Ubuntu 4.15.0-38.41-generic 4.15.18 RelatedPackageVersions: - linux-restricted-modules-4.15.0-38-generic N/A - linux-backports-modules-4.15.0-38-generic N/A - linux-firmware 1.173.1 + linux-restricted-modules-4.15.0-38-generic N/A + linux-backports-modules-4.15.0-38-generic N/A + linux-firmware 1.173.1 RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill' Tags: bionic Uname: Linux 4.15.0-38-generic x86_64 UnreportableReason: This report is about a package that is not installed. UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: - + _MarkForUpload: False dmi.bios.date: 01/25/2018 dmi.bios.vendor: Intel Corporation dmi.bios.version: S1200SP.86B.03.01.1029.012520180838 dmi.board.asset.tag: Base Board Asset Tag dmi.board.name: S1200SP dmi.board.vendor: Intel Corporation dmi.board.version: H57532-271 dmi.chassis.asset.tag: .................... dmi.chassis.type: 23 dmi.chassis.vendor: ............................... dmi.chassis.version: .................. dmi.modalias: dmi:bvnIntelCorporation:bvrS1200SP.86B.03.01.1029.012520180838:bd01/25/2018:svnIntelCorporation:pnS1200SP:pvr....................:rvnIntelCorporation:rnS1200SP:rvrH57532-271:cvn...............................:ct23:cvr..................: dmi.product.family: Family dmi.product.name: S1200SP dmi.product.version: .................... dmi.sys.vendor: Intel Corporation -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1799497 Title: 4.15 kernel hard lockup about once a week Status in linux package in Ubuntu: Incomplete Status in zram-config package in Ubuntu: Incomplete Status in linux source package in Bionic: Confirmed Status in zram-config source package in Bionic: Confirmed Bug description: == SRU Justification == When using zram (as installed and configured with the zram-config package) systems can lockup after about a week of use. This occurs because of a hang in a lock in zram. == Test Case == Run stress-ng --brk 0 --stack 0 in a Bionic amd64 server VM with 1GM of memory, 16 CPU threads and zram-config installed. Without the fix the kernel will hang in a spinlock after 1-2 hours of run time. With the fix, the hang does not occur. Testing shows that with the fix, 5 x 16 CPU hours of stress testing with stress-ng works fine without the lockup occurring. == The fix == Upstream commit c4d6c4cc7bfd ("zram: correct flag name of ZRAM_ACCESS") as a prerequisite followed by a minor context wiggle backport of the fix with commit 3c9959e02547 ("zram: fix lockdep warning of free block handling"). == Regression Potential == This touches the zram locking, so the core zram driver is affected. However the fixes are backports from 5.0, so the fixes have had a fair amount of testing in later kernels. My main server has been running into hard lockups about once a week ever since I switched to the 4.15 Ubuntu 18.04 kernel. When this happens, nothing is printed to the console, it's effectively stuck showing a login prompt. The system is running with panic=1 on the cmdline but isn't rebooting so the kernel isn't even processing this as a kernel panic. As this felt like a potential hardware issue, I had my hosting provider give me a completely different system, different motherboard, different CPU, different RAM and different storage, I installed that system on 18.04 and moved my data over, a week later, I hit the issue again. We've since also had a LXD user reporting similar symptoms here also on varying hardware: https://github.com/lxc/lxd/issues/5197 My system doesn't have a lot of memory pressure with about 50% of free memory: root@vorash:~# free -m total used free shared buff/cache available Mem: 31819 17574 402 513 13842 13292 Swap: 15909 2687 13222 I will now try to increase console logging as much as possible on the system in the hopes that next time it hangs we can get a better idea of what happened but I'm not too hopeful given the complete silence on the console when this occurs. System is currently on: Linux vorash 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux But I've seen this since the GA kernel on 4.15 so it's not a recent regression. --- ProblemType: Bug AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Oct 23 16:12 seq crw-rw---- 1 root audio 116, 33 Oct 23 16:12 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.4 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/22822/fd/10: Permission denied Cannot stat file /proc/22831/fd/10: Permission denied DistroRelease: Ubuntu 18.04 HibernationDevice: RESUME=none CRYPTSETUP=n IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig' Lsusb: Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 002: ID 046b:ff10 American Megatrends, Inc. Virtual Keyboard and Mouse Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Intel Corporation S1200SP NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair Package: linux (not installed) PciMultimedia: ProcEnviron: TERM=xterm PATH=(custom, no user) XDG_RUNTIME_DIR=<set> LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 mgadrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-38-generic root=UUID=575c878a-0be6-4806-9c83-28f67aedea65 ro biosdevname=0 net.ifnames=0 panic=1 verbose console=tty0 console=ttyS0,115200n8 ProcVersionSignature: Ubuntu 4.15.0-38.41-generic 4.15.18 RelatedPackageVersions: linux-restricted-modules-4.15.0-38-generic N/A linux-backports-modules-4.15.0-38-generic N/A linux-firmware 1.173.1 RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill' Tags: bionic Uname: Linux 4.15.0-38-generic x86_64 UnreportableReason: This report is about a package that is not installed. UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: _MarkForUpload: False dmi.bios.date: 01/25/2018 dmi.bios.vendor: Intel Corporation dmi.bios.version: S1200SP.86B.03.01.1029.012520180838 dmi.board.asset.tag: Base Board Asset Tag dmi.board.name: S1200SP dmi.board.vendor: Intel Corporation dmi.board.version: H57532-271 dmi.chassis.asset.tag: .................... dmi.chassis.type: 23 dmi.chassis.vendor: ............................... dmi.chassis.version: .................. dmi.modalias: dmi:bvnIntelCorporation:bvrS1200SP.86B.03.01.1029.012520180838:bd01/25/2018:svnIntelCorporation:pnS1200SP:pvr....................:rvnIntelCorporation:rnS1200SP:rvrH57532-271:cvn...............................:ct23:cvr..................: dmi.product.family: Family dmi.product.name: S1200SP dmi.product.version: .................... dmi.sys.vendor: Intel Corporation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1799497/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp