[Kernel-packages] [Bug 1746806] Re: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU
@racb and @jjohansen I installed kernel 4.4.0-1050-aws, disabled sssd and apparmor on boot, and restarted on a c5 and it boots fine.. also boots fine just disabling sssd on boot. If I start sssd without apparmor running everything is fine. If I start apparmor first, then start sssd it freezes up. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1746806 Title: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU Status in cloud-images: New Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Confirmed Status in sssd package in Ubuntu: Confirmed Bug description: After upgrading to the Ubuntu EC2 AMI from 20180126 (specifically ami-79873901 in us-west-2) we have seen sssd hard locking c5 and m5 EC2 instances after starting the service and CPU goes to 100%. We do not experience this issue with t2 or c4 instance types and we do not see this issue on any instance types using Ubuntu Cloud images from 20180109 or before. I have verified that this is kernel related as I booted an image that we created using the Ubuntu cloud image from 20180109 which works fine on a c5. I then did a "apt update && apt install --only-upgrade linux-aws && systemctl disable sssd", rebooted the server, verified I was on the new kernel and started sssd with "systemctl start sssd" and the EC2 instance froze and Cloudwatch CPU usage for that instance went to 100%. I haven't been able to find much in the syslog, kern.log, journalctl logs, etc. The only thing I have been able to find is that when this happens I tend to see "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" in the syslog and sssd log files. I have attached several log files and the output of a "apport-bug /usr/sbin/sssd". Let me know if you need anything else to help track this down. Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-images/+bug/1746806/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1746806] Re: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU
@jsalisbury: I installed that kernel and rebooted using a c5.xl, it froze. I booted into a c4.xl and it booted fine, disabled the apparmor service and rebooted into a c5.xl and it booted fine. Re-enabled apparmor and rebooted into the c5.xl again and it froze on boot. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1746806 Title: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU Status in cloud-images: New Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Confirmed Status in sssd package in Ubuntu: Confirmed Bug description: After upgrading to the Ubuntu EC2 AMI from 20180126 (specifically ami-79873901 in us-west-2) we have seen sssd hard locking c5 and m5 EC2 instances after starting the service and CPU goes to 100%. We do not experience this issue with t2 or c4 instance types and we do not see this issue on any instance types using Ubuntu Cloud images from 20180109 or before. I have verified that this is kernel related as I booted an image that we created using the Ubuntu cloud image from 20180109 which works fine on a c5. I then did a "apt update && apt install --only-upgrade linux-aws && systemctl disable sssd", rebooted the server, verified I was on the new kernel and started sssd with "systemctl start sssd" and the EC2 instance froze and Cloudwatch CPU usage for that instance went to 100%. I haven't been able to find much in the syslog, kern.log, journalctl logs, etc. The only thing I have been able to find is that when this happens I tend to see "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" in the syslog and sssd log files. I have attached several log files and the output of a "apport-bug /usr/sbin/sssd". Let me know if you need anything else to help track this down. Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-images/+bug/1746806/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1746806] Re: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU
@jsalisbury: I tested the kernel you provided above (commit 7de295e2a47849488acec80fc7c9973a4dca204e) and it boots fine on both a c5.xl and a m5.xl. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1746806 Title: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU Status in cloud-images: New Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Confirmed Status in sssd package in Ubuntu: Confirmed Bug description: After upgrading to the Ubuntu EC2 AMI from 20180126 (specifically ami-79873901 in us-west-2) we have seen sssd hard locking c5 and m5 EC2 instances after starting the service and CPU goes to 100%. We do not experience this issue with t2 or c4 instance types and we do not see this issue on any instance types using Ubuntu Cloud images from 20180109 or before. I have verified that this is kernel related as I booted an image that we created using the Ubuntu cloud image from 20180109 which works fine on a c5. I then did a "apt update && apt install --only-upgrade linux-aws && systemctl disable sssd", rebooted the server, verified I was on the new kernel and started sssd with "systemctl start sssd" and the EC2 instance froze and Cloudwatch CPU usage for that instance went to 100%. I haven't been able to find much in the syslog, kern.log, journalctl logs, etc. The only thing I have been able to find is that when this happens I tend to see "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" in the syslog and sssd log files. I have attached several log files and the output of a "apport-bug /usr/sbin/sssd". Let me know if you need anything else to help track this down. Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-images/+bug/1746806/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1746806] Re: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU
@kamalmostafa I installed the rtp0 kernel and verified it boots fine using c5.xl and m5.xl instances with apparmor & sssd enabled. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1746806 Title: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU Status in cloud-images: New Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Confirmed Status in sssd package in Ubuntu: Confirmed Bug description: After upgrading to the Ubuntu EC2 AMI from 20180126 (specifically ami-79873901 in us-west-2) we have seen sssd hard locking c5 and m5 EC2 instances after starting the service and CPU goes to 100%. We do not experience this issue with t2 or c4 instance types and we do not see this issue on any instance types using Ubuntu Cloud images from 20180109 or before. I have verified that this is kernel related as I booted an image that we created using the Ubuntu cloud image from 20180109 which works fine on a c5. I then did a "apt update && apt install --only-upgrade linux-aws && systemctl disable sssd", rebooted the server, verified I was on the new kernel and started sssd with "systemctl start sssd" and the EC2 instance froze and Cloudwatch CPU usage for that instance went to 100%. I haven't been able to find much in the syslog, kern.log, journalctl logs, etc. The only thing I have been able to find is that when this happens I tend to see "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" in the syslog and sssd log files. I have attached several log files and the output of a "apport-bug /usr/sbin/sssd". Let me know if you need anything else to help track this down. Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-images/+bug/1746806/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1746806] Re: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU
@kamalmostafa: Do you all have a target date for when the new linux-aws kernel (4.4.0-1051.60) will be released? Thanks everyone for your help in quickly tracking down this issue. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1746806 Title: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU Status in cloud-images: New Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Confirmed Status in linux source package in Xenial: In Progress Status in linux-aws source package in Xenial: In Progress Bug description: After upgrading to the Ubuntu EC2 AMI from 20180126 (specifically ami-79873901 in us-west-2) we have seen sssd hard locking c5 and m5 EC2 instances after starting the service and CPU goes to 100%. We do not experience this issue with t2 or c4 instance types and we do not see this issue on any instance types using Ubuntu Cloud images from 20180109 or before. I have verified that this is kernel related as I booted an image that we created using the Ubuntu cloud image from 20180109 which works fine on a c5. I then did a "apt update && apt install --only-upgrade linux-aws && systemctl disable sssd", rebooted the server, verified I was on the new kernel and started sssd with "systemctl start sssd" and the EC2 instance froze and Cloudwatch CPU usage for that instance went to 100%. I haven't been able to find much in the syslog, kern.log, journalctl logs, etc. The only thing I have been able to find is that when this happens I tend to see "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" in the syslog and sssd log files. I have attached several log files and the output of a "apport-bug /usr/sbin/sssd". Let me know if you need anything else to help track this down. Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-images/+bug/1746806/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1746806] Re: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU
@kamalmostafa I installed that kernel package and it worked fine with sssd running on a c5.xl -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1746806 Title: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU Status in cloud-images: In Progress Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Committed Status in linux-aws source package in Xenial: Fix Committed Bug description: After upgrading to the Ubuntu EC2 AMI from 20180126 (specifically ami-79873901 in us-west-2) we have seen sssd hard locking c5 and m5 EC2 instances after starting the service and CPU goes to 100%. We do not experience this issue with t2 or c4 instance types and we do not see this issue on any instance types using Ubuntu Cloud images from 20180109 or before. I have verified that this is kernel related as I booted an image that we created using the Ubuntu cloud image from 20180109 which works fine on a c5. I then did a "apt update && apt install --only-upgrade linux-aws && systemctl disable sssd", rebooted the server, verified I was on the new kernel and started sssd with "systemctl start sssd" and the EC2 instance froze and Cloudwatch CPU usage for that instance went to 100%. I haven't been able to find much in the syslog, kern.log, journalctl logs, etc. The only thing I have been able to find is that when this happens I tend to see "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" in the syslog and sssd log files. I have attached several log files and the output of a "apport-bug /usr/sbin/sssd". Let me know if you need anything else to help track this down. Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-images/+bug/1746806/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1746806] Re: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU
@davidjmemmett I tested the new kernel on one c5.xl yesterday and it worked fine. Deployed the new kernel to all of our environments today and we are seeing intermittent repro of the same behavior we saw in the past (box fails to boot, no SSH available, CPU at 100%). We reverted to the 20180109 Ubuntu AMI (kernel 4.4.0-1047.56) and it is working fine for us again. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1746806 Title: sssd appears to crash AWS c5 and m5 instances, cause 100% CPU Status in cloud-images: Fix Released Status in linux package in Ubuntu: Fix Released Status in linux-aws package in Ubuntu: Fix Released Status in linux source package in Xenial: Fix Released Status in linux-aws source package in Xenial: Fix Released Bug description: After upgrading to the Ubuntu EC2 AMI from 20180126 (specifically ami-79873901 in us-west-2) we have seen sssd hard locking c5 and m5 EC2 instances after starting the service and CPU goes to 100%. We do not experience this issue with t2 or c4 instance types and we do not see this issue on any instance types using Ubuntu Cloud images from 20180109 or before. I have verified that this is kernel related as I booted an image that we created using the Ubuntu cloud image from 20180109 which works fine on a c5. I then did a "apt update && apt install --only-upgrade linux-aws && systemctl disable sssd", rebooted the server, verified I was on the new kernel and started sssd with "systemctl start sssd" and the EC2 instance froze and Cloudwatch CPU usage for that instance went to 100%. I haven't been able to find much in the syslog, kern.log, journalctl logs, etc. The only thing I have been able to find is that when this happens I tend to see "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" in the syslog and sssd log files. I have attached several log files and the output of a "apport-bug /usr/sbin/sssd". Let me know if you need anything else to help track this down. Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-images/+bug/1746806/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1925261] Re: memory leak on AWS kernels when using docker
Tim / Kleber, Thanks for your response on this and I apologize, you are correct that commit 7514c0362ffdd9af953ae94334018e7356b31313 was not the fix for our issue. I had previously just tested the last handful of commits in 5.9.0-rc4 and didn't realize that 7514c0362ffdd9af953ae94334018e7356b31313 was a merge commit and the other commits that didn't include the fix had parent commits prior to this fix being implemented. I tested more kernels this week and narrowed in on commit a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a which appears to fix our issue: commit a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a Author: Peter Xu Date: Fri Aug 21 19:49:57 2020 -0400 mm/gup: Remove enfornced COW mechanism With the more strict (but greatly simplified) page reuse logic in do_wp_page(), we can safely go back to the world where cow is not enforced with writes. This essentially reverts commit 17839856fd58 ("gup: document and work around 'COW can break either way' issue"). There are some context differences due to some changes later on around it: 2170ecfa7688 ("drm/i915: convert get_user_pages() --> pin_user_pages()", 2020-06-03) 376a34efa4ee ("mm/gup: refactor and de-duplicate gup_fast() code", 2020-06-03) Some lines moved back and forth with those, but this revert patch should have striped out and covered all the enforced cow bits anyways. Suggested-by: Linus Torvalds Signed-off-by: Peter Xu Signed-off-by: Linus Torvalds To verify this is the proper fix for the issue we are running into I built a kernel using the parent of this fix (1a0cf26323c80e2f1c58fc04f15686de61bfab0c) and verified it exhibited the broken behavior (memory spikes within our container while running gdb which eventually causes docker to oom kill the container due to hitting the hard memory limit we have set). I then pulled the 1a0cf26323c80e2f1c58fc04f15686de61bfab0c code, cherry-picked a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a and built/ran that kernel and verified that I could no longer repro the issue. I also built kernels with cfc905f158eaa099d6258031614d11869e7ef71c, 4facb95b7adaf77e2da73aafb9ba60996fe42a12 and 9e2369c06c8a181478039258a4598c1ddd2cadfa and verified those exhibited the broken behavior. I then pulled those same commits and cherry picked the a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a fix into them and verified that fixed the behavior we are seeing. Here is a list of all the commits that I tested in the past few days to narrow in on this commit as the fix: 9322c47b21b9e05d7f9c037aa2c472e9f0dc7f3b - FIXED b17164e258e3888d376a7434415013175d637377 - FIXED 1ef6ea0efe8e68d0299dad44c39dc6ad9e5d1f39 - FIXED c183edff33fdcd639d222a8f473bf44602adc655 - BROKEN - parent commits were based off rc1 branch, prior to a308c71 fix c70672d8d316ebd46ea447effadfe57ab7a30a50 - FIXED 09274aed9021642cb3e5e0eb0e657a13ee3eafed - FIXED 16bf121b2ddebd4421bd73098eaae1500dd40389 - FIXED 41bef91c8aa351255cd19e7e72608ee86f7f4bab - FIXED f162626a038ec06da98ac38ce3d6bdbd715e9c5f - FIXED d824e0809ce3c9e935f3aa37381cda7fd4184f12 - FIXED 8075fc3b113dee1531106aaec3dfa19c8158374d - FIXED d849ca483dba7546ad176da83bf66d1c013725f6 - FIXED 2fb547911ca54bc9ffa2709c55c9a7638ac50ae4 - FIXED e2dacf6cd13c1f8d40a59fdda41ecd139c2207df - FIXED 86edf52e7c7201fabfba39ae694a5206d48e77af - FIXED cf85f5de83b19361c3d575fa0ea05d8194bb0d05 - FIXED acf69c946233259ab4d64f8869d4037a198c7f06 - FIXED b25d1dc9474e1f0cefca994885e82beea271acfe - FIXED f7ce2c3afc938b7d743ee8e5563560c04e17d7be - BROKEN - parents 763700f + eacc9c5 were prior to fix in a308c71 798a6b87ecd72828a6c6b5469aaa2032a57e92b7 - FIXED - parent is a308c71 so has fix a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a - FIXED - parent is 1a0cf26 which is broken so looks like this is the actual code fix 1a0cf26323c80e2f1c58fc04f15686de61bfab0c - BROKEN 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 - BROKEN cfc905f158eaa099d6258031614d11869e7ef71c - BROKEN 7b81ce7cdcef3a3ae71eb3fb863433c646b4a2f4 - BROKEN 4facb95b7adaf77e2da73aafb9ba60996fe42a12 - BROKEN d5c678aed5eddb944b8e7ce451b107b39245962d - BROKEN 662a0221893a3d58aa72719671844264306f6e4b - BROKEN 2356bb4b8221d7dc8c7beb810418122ed90254c9 - BROKEN 29aaebbca4abc4cceb38738483051abefafb6950 - BROKEN 2822e582501b65707089b097e773e6fd70774841 - BROKEN 7cad554887f1c5fd77e57e6bf4be38370c2160cb - BROKEN e52d58d54a321d4fe9d0ecdabe4f8774449f0d6e - BROKEN 26e495f341075c09023ba16dee9a7f37a021e745 - BROKEN a5f785ce608cafc444cdf047d1791d5ad97943ba - BROKEN 0ffdab6f2dea9e23ec33230de24e492ff0b186d9 - BROKEN 30d24faba0532d6972df79a1bf060601994b5873 - BROKEN 2d33b7d631d9dc81c78bb71368645cf7f0e68cb1 - BROKEN 6e4e9ec65078093165463c13d4eb92b3e8d7b2e8 - BROKEN 365d2a23663711c32e778c9c18b07163f9193925 - BROKEN 9e2369c06c8a181478039258a4598c1ddd2cadfa - BROKEN Please let me know if you need any further information from me on this. Thanks, Paul -- You received this bug notification because you are a member of Kernel Packages, w
[Kernel-packages] [Bug 1925261] Re: memory leak on AWS kernels when using docker
Tim, We are running Ubuntu 18.04 with the 5.3.0-1030-aws kernel because that is the last Ubuntu provided AMI (ubuntu- bionic-18.04-amd64-server-20200716) that does not contain this kernel bug. I tried installing the latest supported Ubuntu 18.04 kernel again yesterday (5.4.0-1055-aws) and verified that the bug still exists. Today I confirmed that installing the linux-image-5.11.0-1016-aws focal kernel from https://packages.ubuntu.com/focal-updates/linux- image-5.11.0-1016-aws on a Ubuntu 18.04 instance does fix this issue for us. My understanding from https://ubuntu.com/about/release-cycle#ubuntu- kernel-release-cycle is this configuration isn't really supported/suggested though, right? We would prefer to use the fully supported Ubuntu 18.04 AWS kernel if possible. Thanks, Paul -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1925261 Title: memory leak on AWS kernels when using docker Status in linux-aws package in Ubuntu: New Status in linux-aws source package in Focal: New Bug description: Ever since the "ubuntu-bionic-18.04-amd64-server-20200729" EC2 Ubuntu AMI was released which has the "5.3.0-1032-aws" kernel we have been hitting a 100% repro memory leak that causes our app that is running under docker to be OOM killed. The scenario is that we have an app running in a docker container and it occasionally catches a crash happening within itself and when that happens it creates another process which triggers a gdb dump of that parent app. Normally this works fine but under these specific kernels it causes the memory usage to grow and grow until it hits the maximum allowed memory for the container at which point the container is killed. I have tested using several of the latest available Ubuntu AMIs including the latest "ubuntu-bionic-18.04-amd64-server-20210415" which has the "5.4.0-1045-aws" kernel and the bug still exists. I also tested a bunch of the mainline kernels and found the fix was introduced for this memory leak in the v5.9-rc4 kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.9-rc4/CHANGES). Do you all have any idea if or when that set of changes will be backported into a supported kernel for Ubuntu 18.04 or 20.04? Release we are running: root@:~# lsb_release -rd Description: Ubuntu 18.04.5 LTS Release: 18.04 Docker / containerd.io versions: - containerd.io: 1.4.4-1 - docker-ce: 5:20.10.5~3-0~ubuntu-bionic Latest supported kernel I tried which still sees the memory leak: root@hostname:~# apt-cache policy linux-aws linux-aws: Installed: 5.4.0.1045.27 Candidate: 5.4.0.1045.27 Version table: *** 5.4.0.1045.27 500 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1925261/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1925261] [NEW] memory leak on AWS kernels when using docker
Public bug reported: Ever since the "ubuntu-bionic-18.04-amd64-server-20200729" EC2 Ubuntu AMI was released which has the "5.3.0-1032-aws" kernel we have been hitting a 100% repro memory leak that causes our app that is running under docker to be OOM killed. The scenario is that we have an app running in a docker container and it occasionally catches a crash happening within itself and when that happens it creates another process which triggers a gdb dump of that parent app. Normally this works fine but under these specific kernels it causes the memory usage to grow and grow until it hits the maximum allowed memory for the container at which point the container is killed. I have tested using several of the latest available Ubuntu AMIs including the latest "ubuntu-bionic-18.04-amd64-server-20210415" which has the "5.4.0-1045-aws" kernel and the bug still exists. I also tested a bunch of the mainline kernels and found the fix was introduced for this memory leak in the v5.9-rc4 kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.9-rc4/CHANGES). Do you all have any idea if or when that set of changes will be backported into a supported kernel for Ubuntu 18.04 or 20.04? Release we are running: root@:~# lsb_release -rd Description:Ubuntu 18.04.5 LTS Release:18.04 Docker / containerd.io versions: - containerd.io: 1.4.4-1 - docker-ce: 5:20.10.5~3-0~ubuntu-bionic Latest supported kernel I tried which still sees the memory leak: root@hostname:~# apt-cache policy linux-aws linux-aws: Installed: 5.4.0.1045.27 Candidate: 5.4.0.1045.27 Version table: *** 5.4.0.1045.27 500 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages Thanks, Paul ** Affects: linux-aws (Ubuntu) Importance: Undecided Status: New ** Description changed: Ever since the "ubuntu-bionic-18.04-amd64-server-20200729" EC2 Ubuntu AMI was released which has the "5.3.0-1032-aws" kernel we have been hitting a 100% repro memory leak that causes our app that is running under docker to be OOM killed. The scenario is that we have an app running in a docker container and it occasionally catches a crash happening within itself and when that happens it creates another process which triggers a gdb dump of that parent app. Normally this works fine but under these specific kernels it causes the memory usage to grow and grow until it hits the maximum allowed memory for the container at which point the container is killed. I have tested using several of the latest available Ubuntu AMIs including the latest "ubuntu-bionic-18.04-amd64-server-20210415" which has the "5.4.0-1045-aws" kernel and the bug still exists. I also tested a bunch of the mainline kernels and found the fix was introduced for this memory leak in the v5.9-rc4 kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.9-rc4/CHANGES). Do you all have any idea if or when that set of changes will be backported into a supported kernel for Ubuntu 18.04 or 20.04? Release we are running: root@:~# lsb_release -rd Description: Ubuntu 18.04.5 LTS Release: 18.04 Docker / containerd.io versions: - containerd.io: 1.4.4-1 - docker-ce: 5:20.10.5~3-0~ubuntu-bionic Latest supported kernel I tried which still sees the memory leak: - root@us-east-1a-dev-devops03-reg-gs-i-04742b937b7628f05:~# apt-cache policy linux-aws + root@hostname:~# apt-cache policy linux-aws linux-aws: - Installed: 5.4.0.1045.27 - Candidate: 5.4.0.1045.27 - Version table: - *** 5.4.0.1045.27 500 - 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages - 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages - 100 /var/lib/dpkg/status - 4.15.0.1007.7 500 - 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages + Installed: 5.4.0.1045.27 + Candidate: 5.4.0.1045.27 + Version table: + *** 5.4.0.1045.27 500 + 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages + 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages + 100 /var/lib/dpkg/status + 4.15.0.1007.7 500 + 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages Thanks, Paul -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1925261 Title: memory leak on AWS kernels when using docker Status in linux-aws package in Ubuntu: New Bug description: Ever since the "ubuntu-bionic-18.04-amd64-server-20200729" EC2 Ubuntu AMI was released which has the "5.3
[Kernel-packages] [Bug 1925261] Re: memory leak on AWS kernels when using docker
Kleber, I am not sure exactly which commit fixes the issue we are experiencing. I will put some time into bisecting the commits introduced in v5.9-rc4 and building/testing kernels with that code to see if I can narrow down the exact commit that introduced the fix. Thanks, Paul -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1925261 Title: memory leak on AWS kernels when using docker Status in linux-aws package in Ubuntu: New Status in linux-aws source package in Focal: New Bug description: Ever since the "ubuntu-bionic-18.04-amd64-server-20200729" EC2 Ubuntu AMI was released which has the "5.3.0-1032-aws" kernel we have been hitting a 100% repro memory leak that causes our app that is running under docker to be OOM killed. The scenario is that we have an app running in a docker container and it occasionally catches a crash happening within itself and when that happens it creates another process which triggers a gdb dump of that parent app. Normally this works fine but under these specific kernels it causes the memory usage to grow and grow until it hits the maximum allowed memory for the container at which point the container is killed. I have tested using several of the latest available Ubuntu AMIs including the latest "ubuntu-bionic-18.04-amd64-server-20210415" which has the "5.4.0-1045-aws" kernel and the bug still exists. I also tested a bunch of the mainline kernels and found the fix was introduced for this memory leak in the v5.9-rc4 kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.9-rc4/CHANGES). Do you all have any idea if or when that set of changes will be backported into a supported kernel for Ubuntu 18.04 or 20.04? Release we are running: root@:~# lsb_release -rd Description: Ubuntu 18.04.5 LTS Release: 18.04 Docker / containerd.io versions: - containerd.io: 1.4.4-1 - docker-ce: 5:20.10.5~3-0~ubuntu-bionic Latest supported kernel I tried which still sees the memory leak: root@hostname:~# apt-cache policy linux-aws linux-aws: Installed: 5.4.0.1045.27 Candidate: 5.4.0.1045.27 Version table: *** 5.4.0.1045.27 500 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1925261/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1925261] Re: memory leak on AWS kernels when using docker
Kleber, I finally had some time to narrow down what commit fixes this issue for us today, below is the commit: commit 7514c0362ffdd9af953ae94334018e7356b31313 Merge: 9322c47b21b9 428fc0aff4e5 Author: Linus Torvalds Date: Sat Sep 5 13:28:40 2020 -0700 Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "19 patches. Subsystems affected by this patch series: MAINTAINERS, ipc, fork, checkpatch, lib, and mm (memcg, slub, pagemap, madvise, migration, hugetlb)" * emailed patches from Andrew Morton : include/linux/log2.h: add missing () around n in roundup_pow_of_two() mm/khugepaged.c: fix khugepaged's request size in collapse_file mm/hugetlb: fix a race between hugetlb sysctl handlers mm/hugetlb: try preferred node first when alloc gigantic page from cma mm/migrate: preserve soft dirty in remove_migration_pte() mm/migrate: remove unnecessary is_zone_device_page() check mm/rmap: fixup copying of soft dirty and uffd ptes mm/migrate: fixup setting UFFD_WP flag mm: madvise: fix vma user-after-free checkpatch: fix the usage of capture group ( ... ) fork: adjust sysctl_max_threads definition to match prototype ipc: adjust proc_ipc_sem_dointvec definition to match prototype mm: track page table modifications in __apply_to_page_range() MAINTAINERS: IA64: mark Status as Odd Fixes only MAINTAINERS: add LLVM maintainers MAINTAINERS: update Cavium/Marvell entries mm: slub: fix conversion of freelist_corrupted() mm: memcg: fix memcg reclaim soft lockup memcg: fix use-after-free in uncharge_batch I also verified that the latest available Ubuntu 18.04 kernel as of today (5.4.0-1054.57~18.04.1) still hits this memory leak issue for us. Please let me know if you need any further information from us to hopefully get this fix pulled into the supported Ubuntu 18.04 AWS kernel. Thanks, Paul -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1925261 Title: memory leak on AWS kernels when using docker Status in linux-aws package in Ubuntu: New Status in linux-aws source package in Focal: New Bug description: Ever since the "ubuntu-bionic-18.04-amd64-server-20200729" EC2 Ubuntu AMI was released which has the "5.3.0-1032-aws" kernel we have been hitting a 100% repro memory leak that causes our app that is running under docker to be OOM killed. The scenario is that we have an app running in a docker container and it occasionally catches a crash happening within itself and when that happens it creates another process which triggers a gdb dump of that parent app. Normally this works fine but under these specific kernels it causes the memory usage to grow and grow until it hits the maximum allowed memory for the container at which point the container is killed. I have tested using several of the latest available Ubuntu AMIs including the latest "ubuntu-bionic-18.04-amd64-server-20210415" which has the "5.4.0-1045-aws" kernel and the bug still exists. I also tested a bunch of the mainline kernels and found the fix was introduced for this memory leak in the v5.9-rc4 kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.9-rc4/CHANGES). Do you all have any idea if or when that set of changes will be backported into a supported kernel for Ubuntu 18.04 or 20.04? Release we are running: root@:~# lsb_release -rd Description: Ubuntu 18.04.5 LTS Release: 18.04 Docker / containerd.io versions: - containerd.io: 1.4.4-1 - docker-ce: 5:20.10.5~3-0~ubuntu-bionic Latest supported kernel I tried which still sees the memory leak: root@hostname:~# apt-cache policy linux-aws linux-aws: Installed: 5.4.0.1045.27 Candidate: 5.4.0.1045.27 Version table: *** 5.4.0.1045.27 500 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1925261/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1925261] Re: memory leak on AWS kernels when using docker
Kleber, Sounds good, thank you! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1925261 Title: memory leak on AWS kernels when using docker Status in linux-aws package in Ubuntu: New Status in linux-aws source package in Focal: New Bug description: Ever since the "ubuntu-bionic-18.04-amd64-server-20200729" EC2 Ubuntu AMI was released which has the "5.3.0-1032-aws" kernel we have been hitting a 100% repro memory leak that causes our app that is running under docker to be OOM killed. The scenario is that we have an app running in a docker container and it occasionally catches a crash happening within itself and when that happens it creates another process which triggers a gdb dump of that parent app. Normally this works fine but under these specific kernels it causes the memory usage to grow and grow until it hits the maximum allowed memory for the container at which point the container is killed. I have tested using several of the latest available Ubuntu AMIs including the latest "ubuntu-bionic-18.04-amd64-server-20210415" which has the "5.4.0-1045-aws" kernel and the bug still exists. I also tested a bunch of the mainline kernels and found the fix was introduced for this memory leak in the v5.9-rc4 kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.9-rc4/CHANGES). Do you all have any idea if or when that set of changes will be backported into a supported kernel for Ubuntu 18.04 or 20.04? Release we are running: root@:~# lsb_release -rd Description: Ubuntu 18.04.5 LTS Release: 18.04 Docker / containerd.io versions: - containerd.io: 1.4.4-1 - docker-ce: 5:20.10.5~3-0~ubuntu-bionic Latest supported kernel I tried which still sees the memory leak: root@hostname:~# apt-cache policy linux-aws linux-aws: Installed: 5.4.0.1045.27 Candidate: 5.4.0.1045.27 Version table: *** 5.4.0.1045.27 500 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status 4.15.0.1007.7 500 500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages Thanks, Paul To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1925261/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp