[Kernel-packages] [Bug 1655842] Re: "Out of memory" errors after upgrade to 4.4.0-59

Erik Hess Mon, 27 Nov 2017 09:41:35 -0800

In our production environment of ~1800 nodes we've seen oom-kill events
that looked similar to this bug's pattern  - oom-kills killing large
server processes while resident memory was far lower than available
physical memory.


We were affected by the original bug and saw that issue readily
addressed in newer kernel versions, as mentioned in the earlier comments
in this ticket. However, we still kept seeing oom-kill events, albeit in
far lower numbers over time, that were happening on kernel-upgraded
systems. These were a mystery for awhile, largely due to their
infrequent occurrence.

After a lot of research we think we've pinned it down to a subset of our
multi-socket servers that have >1 NUMA memory pools. After implementing
some scripts to track NUMA stats we've observed that one of the two NUMA
pools is being fully utilized while the other has large amounts of
memory to spare (often 90-95%) Either our server app, the JVM its
running on, or the kernel itself isn't handling the NUMA memory pooling
well and we're ending up exhausting an entire NUMA pool.

Work is ongoing to see the causality chain that's leading to this. We
don't yet have confirmation about whether its something our app (or its
libraries) is doing, if we just need to make the JVM NUMA-aware with
args, or if there's kernel tuning to be done. But I did want to mention
it here as a warning to folks running on multi-NUMA-pool multi-socket
systems seeing similar behavior.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/1655842

Title:
  "Out of memory" errors after upgrade to 4.4.0-59

Status in linux package in Ubuntu:
  Fix Released
Status in linux-aws package in Ubuntu:
  Confirmed
Status in linux-raspi2 package in Ubuntu:
  Fix Committed
Status in linux source package in Xenial:
  Fix Released
Status in linux-aws source package in Xenial:
  Confirmed
Status in linux-raspi2 source package in Xenial:
  Fix Committed

Bug description:
  After a fix for LP#1647400, a bug that caused freezes under some
  workloads, some users noticed regular OOMs. Those regular OOMs were
  reported under this bug, and fixed after some releases.

  Some of the affected kernels are documented below. In order to check
  your particular kernel, read its changelog and lookup for 1655842 and
  1647400. If it has the fix for 1647400, but not the fix for 1655842,
  then it's affected.

  It's still possible that you notice regressions compared to kernels
  that didn't have the fixes for any of the bugs. However, reverting all
  fixes would cause the freeze bug to come back. So, it's not a possible
  solution moving forward.

  If you see any regressions, in the form of OOMs, mainly, please report
  a new bug. Different workloads may require different solutions, and
  it's possible that further fixes are needed, be them upstream or not.
  The best way to get such fixes applied is reporting that under a new
  bug, one that can be verified, so being able to reproduce the bug
  makes it possible to verify the fixes really fix the identified bug.

  Kernels affected:

  linux  4.4.0-58, 4.4.0-59, 4.4.0-60, 4.4.0-61, 4.4.0-62.
  linux-raspi2  4.4.0-1039 to 4.4.0-1042 and 4.4.0-1044 to 4.4.0-1071

  
  Particular kernels NOT affected by THIS bug:

  linux-aws

  To reiterate, if you find an OOM with an affected kernel, please upgrade.
  If you find an OOM with a non-affected kernel, please report a new bug. We 
want to investigate it and fix it.

  
  ===================
  I recently replaced some Xenial servers, and started experiencing "Out of 
memory" problems with the default kernel.

  We bake Amazon AMIs based on an official Ubuntu-provided image (ami-
  e6b58e85, in ap-southeast-2, from https://cloud-
  images.ubuntu.com/locator/ec2/).  Previous versions of our AMI
  included "4.4.0-57-generic", but the latest version picked up
  "4.4.0-59-generic" as part of a "dist-upgrade".

  Instances booted using the new AMI have been using more memory, and
  experiencing OOM issues - sometimes during boot, and sometimes a while
  afterwards.  An example from the system log is:

  [  130.113411] cloud-init[1560]: Cloud-init v. 0.7.8 running 'modules:final' 
at Wed, 11 Jan 2017 22:07:53 +0000. Up 29.28 seconds.
  [  130.124219] cloud-init[1560]: Cloud-init v. 0.7.8 finished at Wed, 11 Jan 
2017 22:09:35 +0000. Datasource DataSourceEc2.  Up 130.09 seconds
  [29871.137128] Out of memory: Kill process 2920 (ruby) score 107 or sacrifice 
child
  [29871.140816] Killed process 2920 (ruby) total-vm:675048kB, 
anon-rss:51184kB, file-rss:2164kB
  [29871.449209] Out of memory: Kill process 3257 (splunkd) score 97 or 
sacrifice child
  [29871.453282] Killed process 3258 (splunkd) total-vm:66272kB, 
anon-rss:6676kB, file-rss:0kB
  [29871.677910] Out of memory: Kill process 2647 (fluentd) score 51 or 
sacrifice child
  [29871.681872] Killed process 2647 (fluentd) total-vm:117944kB, 
anon-rss:23956kB, file-rss:1356kB

  I have a hunch that this may be related to the fix for
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400,
  introduced in linux (4.4.0-58.79).

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: linux-image-4.4.0-59-generic 4.4.0-59.80
  ProcVersionSignature: User Name 4.4.0-59.80-generic 4.4.35
  Uname: Linux 4.4.0-59-generic x86_64
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 Jan 12 06:29 seq
   crw-rw---- 1 root audio 116, 33 Jan 12 06:29 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.1-0ubuntu2.4
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Date: Thu Jan 12 06:38:45 2017
  Ec2AMI: ami-0f93966c
  Ec2AMIManifest: (unknown)
  Ec2AvailabilityZone: ap-southeast-2a
  Ec2InstanceType: t2.nano
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  Lsusb: Error: command ['lsusb'] failed with exit code 1:
  MachineType: Xen HVM domU
  PciMultimedia:

  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 cirrusdrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-59-generic 
root=UUID=fb0fef08-f3c5-40bf-9776-f7ba00fe72be ro console=tty1 console=ttyS0
  RelatedPackageVersions:
   linux-restricted-modules-4.4.0-59-generic N/A
   linux-backports-modules-4.4.0-59-generic  N/A
   linux-firmware                            1.157.6
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  SourcePackage: linux
  UpgradeStatus: No upgrade log present (probably fresh install)
  dmi.bios.date: 12/09/2016
  dmi.bios.vendor: Xen
  dmi.bios.version: 4.2.amazon
  dmi.chassis.type: 1
  dmi.chassis.vendor: Xen
  dmi.modalias: 
dmi:bvnXen:bvr4.2.amazon:bd12/09/2016:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
  dmi.product.name: HVM domU
  dmi.product.version: 4.2.amazon
  dmi.sys.vendor: Xen

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1655842] Re: "Out of memory" errors after upgrade to 4.4.0-59

Reply via email to