In our production environment of ~1800 nodes we've seen oom-kill events that looked similar to this bug's pattern - oom-kills killing large server processes while resident memory was far lower than available physical memory.
We were affected by the original bug and saw that issue readily addressed in newer kernel versions, as mentioned in the earlier comments in this ticket. However, we still kept seeing oom-kill events, albeit in far lower numbers over time, that were happening on kernel-upgraded systems. These were a mystery for awhile, largely due to their infrequent occurrence. After a lot of research we think we've pinned it down to a subset of our multi-socket servers that have >1 NUMA memory pools. After implementing some scripts to track NUMA stats we've observed that one of the two NUMA pools is being fully utilized while the other has large amounts of memory to spare (often 90-95%) Either our server app, the JVM its running on, or the kernel itself isn't handling the NUMA memory pooling well and we're ending up exhausting an entire NUMA pool. Work is ongoing to see the causality chain that's leading to this. We don't yet have confirmation about whether its something our app (or its libraries) is doing, if we just need to make the JVM NUMA-aware with args, or if there's kernel tuning to be done. But I did want to mention it here as a warning to folks running on multi-NUMA-pool multi-socket systems seeing similar behavior. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1655842 Title: "Out of memory" errors after upgrade to 4.4.0-59 Status in linux package in Ubuntu: Fix Released Status in linux-aws package in Ubuntu: Confirmed Status in linux-raspi2 package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Released Status in linux-aws source package in Xenial: Confirmed Status in linux-raspi2 source package in Xenial: Fix Committed Bug description: After a fix for LP#1647400, a bug that caused freezes under some workloads, some users noticed regular OOMs. Those regular OOMs were reported under this bug, and fixed after some releases. Some of the affected kernels are documented below. In order to check your particular kernel, read its changelog and lookup for 1655842 and 1647400. If it has the fix for 1647400, but not the fix for 1655842, then it's affected. It's still possible that you notice regressions compared to kernels that didn't have the fixes for any of the bugs. However, reverting all fixes would cause the freeze bug to come back. So, it's not a possible solution moving forward. If you see any regressions, in the form of OOMs, mainly, please report a new bug. Different workloads may require different solutions, and it's possible that further fixes are needed, be them upstream or not. The best way to get such fixes applied is reporting that under a new bug, one that can be verified, so being able to reproduce the bug makes it possible to verify the fixes really fix the identified bug. Kernels affected: linux 4.4.0-58, 4.4.0-59, 4.4.0-60, 4.4.0-61, 4.4.0-62. linux-raspi2 4.4.0-1039 to 4.4.0-1042 and 4.4.0-1044 to 4.4.0-1071 Particular kernels NOT affected by THIS bug: linux-aws To reiterate, if you find an OOM with an affected kernel, please upgrade. If you find an OOM with a non-affected kernel, please report a new bug. We want to investigate it and fix it. =================== I recently replaced some Xenial servers, and started experiencing "Out of memory" problems with the default kernel. We bake Amazon AMIs based on an official Ubuntu-provided image (ami- e6b58e85, in ap-southeast-2, from https://cloud- images.ubuntu.com/locator/ec2/). Previous versions of our AMI included "4.4.0-57-generic", but the latest version picked up "4.4.0-59-generic" as part of a "dist-upgrade". Instances booted using the new AMI have been using more memory, and experiencing OOM issues - sometimes during boot, and sometimes a while afterwards. An example from the system log is: [ 130.113411] cloud-init[1560]: Cloud-init v. 0.7.8 running 'modules:final' at Wed, 11 Jan 2017 22:07:53 +0000. Up 29.28 seconds. [ 130.124219] cloud-init[1560]: Cloud-init v. 0.7.8 finished at Wed, 11 Jan 2017 22:09:35 +0000. Datasource DataSourceEc2. Up 130.09 seconds [29871.137128] Out of memory: Kill process 2920 (ruby) score 107 or sacrifice child [29871.140816] Killed process 2920 (ruby) total-vm:675048kB, anon-rss:51184kB, file-rss:2164kB [29871.449209] Out of memory: Kill process 3257 (splunkd) score 97 or sacrifice child [29871.453282] Killed process 3258 (splunkd) total-vm:66272kB, anon-rss:6676kB, file-rss:0kB [29871.677910] Out of memory: Kill process 2647 (fluentd) score 51 or sacrifice child [29871.681872] Killed process 2647 (fluentd) total-vm:117944kB, anon-rss:23956kB, file-rss:1356kB I have a hunch that this may be related to the fix for https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400, introduced in linux (4.4.0-58.79). ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.4.0-59-generic 4.4.0-59.80 ProcVersionSignature: User Name 4.4.0-59.80-generic 4.4.35 Uname: Linux 4.4.0-59-generic x86_64 AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Jan 12 06:29 seq crw-rw---- 1 root audio 116, 33 Jan 12 06:29 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.1-0ubuntu2.4 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Date: Thu Jan 12 06:38:45 2017 Ec2AMI: ami-0f93966c Ec2AMIManifest: (unknown) Ec2AvailabilityZone: ap-southeast-2a Ec2InstanceType: t2.nano Ec2Kernel: unavailable Ec2Ramdisk: unavailable IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' Lsusb: Error: command ['lsusb'] failed with exit code 1: MachineType: Xen HVM domU PciMultimedia: ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR=<set> LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 cirrusdrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-59-generic root=UUID=fb0fef08-f3c5-40bf-9776-f7ba00fe72be ro console=tty1 console=ttyS0 RelatedPackageVersions: linux-restricted-modules-4.4.0-59-generic N/A linux-backports-modules-4.4.0-59-generic N/A linux-firmware 1.157.6 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 12/09/2016 dmi.bios.vendor: Xen dmi.bios.version: 4.2.amazon dmi.chassis.type: 1 dmi.chassis.vendor: Xen dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd12/09/2016:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr: dmi.product.name: HVM domU dmi.product.version: 4.2.amazon dmi.sys.vendor: Xen To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp