Public bug reported: BugLink: https://bugs.launchpad.net/bugs/2076957
[Impact] In latency sensitive environments, it is very common to use isolcpus to reserve a set of cpus that no other processes are to be placed on, and run just dpdk in poll mode. There is a bug in the jammy kernel, where if cgroups V2 are enabled, after several minutes the kernel will place other processes onto these reserved isolcpus at random. This disturbs dpdk and introduces latency. The issue does not occur with cgroups V1, so a workaround is to use cgroups V1 instead of V2 for the moment. [Fix] I arrived at this commit after a full git bisect, which fixes the issue. It landed in 6.2-rc1: commit 7fd4da9c1584be97ffbc40e600a19cb469fd4e78 Author: Waiman Long <long...@redhat.com> Date: Sat Nov 12 17:19:39 2022 -0500 Subject: cgroup/cpuset: Optimize cpuset_attach() on v2 Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fd4da9c1584be97ffbc40e600a19cb469fd4e78 Only the 5.15 Jammy kernel needs this fix. Focal works correctly as is. The commit skips calls to cpuset_attach() if the underlying cpusets or memory have not changed in a cgroup, and it seems to fix the issue. [Testcase] Deploy a bare metal server, ideally with a number of cores, 56 should be plenty. Use Jammy, with the 5.15 GA kernel. 1) Edit /etc/default/grub and set GRUB_CMDLINE_LINUX_DEFAULT to have "isolcpus=4-7,32-35 rcu_nocb_poll rcu_nocbs=4-7,32-35 systemd.unified_cgroup_hierarchy=1" 2) sudo reboot 3) sudo cat /sys/devices/system/cpu/isolated 4-7,32-35 4) sudo apt install s-tui stress 5) sudo s-tui 6) htop 7) $ while true; do sudo ps -eLF | head -n 1; sudo ps -eLF | grep stress | awk -v a="4" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="5" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="6" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="7" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="32" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="33" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="34" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="35" '$9 == a {print;}'; sleep 5; done Setup isolcpus to separate off 4-7 and 32-35, so each NUMA node has a set of isolated CPUs. s-tui is a great frontend for stress, and it starts stress processes. All stress processes should initially be on non-isolated CPUs, confirm this with htop, that 4-7 and 32-25 are at 0% while every other cpu is at 100%. After 3 minutes, but sometimes it takes up to 10 minutes, a stress process, or the s-tui process will be incorrectly placed onto an isolated cpu, causing it to increase in usage in htop. The while script checking ps with cpu affinities will also likely be printing the incorrectly placed process. A test kernel is available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf391137-test If you install it, the processes will not be placed onto the isolated cpus. [Where problems could occur] The patch changes how cgroups determines when cpuset_attach() should be called. cpuset_attach() is currently called very frequently in the 5.15 Jammy kernel, but most operations should be NOP due to no changes occurring in cpusets or memory in the cgroup the process is attached to. We are changing it to instead skip calling cpuset_attach() if there are no changes, which should offer a small performance increase, as well as fixing this isolcpus bug. If a regression were to occur, it would affect cgroups V2 only, and it could cause resource limits to be applied incorrectly in the worst case. ** Affects: linux (Ubuntu) Importance: Undecided Status: Fix Released ** Affects: linux (Ubuntu Jammy) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Tags: jammy sts ** Also affects: linux (Ubuntu Jammy) Importance: Undecided Status: New ** Changed in: linux (Ubuntu) Status: New => Fix Released ** Changed in: linux (Ubuntu Jammy) Status: New => In Progress ** Changed in: linux (Ubuntu Jammy) Importance: Undecided => Medium ** Changed in: linux (Ubuntu Jammy) Assignee: (unassigned) => Matthew Ruffell (mruffell) ** Description changed: - BugLink: https://bugs.launchpad.net/bugs/ + BugLink: https://bugs.launchpad.net/bugs/2076957 [Impact] In latency sensitive environments, it is very common to use isolcpus to reserve a set of cpus that no other processes are to be placed on, and run just dpdk in poll mode. There is a bug in the jammy kernel, where if cgroups V2 are enabled, after several minutes the kernel will place other processes onto these reserved isolcpus at random. This disturbs dpdk and introduces latency. The issue does not occur with cgroups V1, so a workaround is to use cgroups V1 instead of V2 for the moment. [Fix] I arrived at this commit after a full git bisect, which fixes the issue. It landed in 6.2-rc1: commit 7fd4da9c1584be97ffbc40e600a19cb469fd4e78 Author: Waiman Long <long...@redhat.com> Date: Sat Nov 12 17:19:39 2022 -0500 Subject: cgroup/cpuset: Optimize cpuset_attach() on v2 Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fd4da9c1584be97ffbc40e600a19cb469fd4e78 Only the 5.15 Jammy kernel needs this fix. Focal works correctly as is. The commit skips calls to cpuset_attach() if the underlying cpusets or memory have not changed in a cgroup, and it seems to fix the issue. [Testcase] Deploy a bare metal server, ideally with a number of cores, 56 should be plenty. Use Jammy, with the 5.15 GA kernel. 1) Edit /etc/default/grub and set GRUB_CMDLINE_LINUX_DEFAULT to have "isolcpus=4-7,32-35 rcu_nocb_poll rcu_nocbs=4-7,32-35 systemd.unified_cgroup_hierarchy=1" 2) sudo reboot 3) sudo cat /sys/devices/system/cpu/isolated 4-7,32-35 4) sudo apt install s-tui stress 5) sudo s-tui 6) htop 7) $ while true; do sudo ps -eLF | head -n 1; sudo ps -eLF | grep stress | awk -v a="4" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="5" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="6" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="7" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="32" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="33" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="34" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="35" '$9 == a {print;}'; sleep 5; done Setup isolcpus to separate off 4-7 and 32-35, so each NUMA node has a set of isolated CPUs. s-tui is a great frontend for stress, and it starts stress processes. All stress processes should initially be on non-isolated CPUs, confirm this with htop, that 4-7 and 32-25 are at 0% while every other cpu is at 100%. After 3 minutes, but sometimes it takes up to 10 minutes, a stress process, or the s-tui process will be incorrectly placed onto an isolated cpu, causing it to increase in usage in htop. The while script checking ps with cpu affinities will also likely be printing the incorrectly placed process. A test kernel is available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf391137-test If you install it, the processes will not be placed onto the isolated cpus. [Where problems could occur] The patch changes how cgroups determines when cpuset_attach() should be called. cpuset_attach() is currently called very frequently in the 5.15 Jammy kernel, but most operations should be NOP due to no changes occurring in cpusets or memory in the cgroup the process is attached to. We are changing it to instead skip calling cpuset_attach() if there are no changes, which should offer a small performance increase, as well as fixing this isolcpus bug. If a regression were to occur, it would affect cgroups V2 only, and it could cause resource limits to be applied incorrectly in the worst case. ** Tags added: jammy sts -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2076957 Title: isolcpus are ignored when using cgroups V2, causing processes to have wrong affinity To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2076957/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs