Public bug reported:

BugLink: https://bugs.launchpad.net/bugs/2076957

[Impact]

In latency sensitive environments, it is very common to use isolcpus to
reserve a set of cpus that no other processes are to be placed on, and
run just dpdk in poll mode.

There is a bug in the jammy kernel, where if cgroups V2 are enabled,
after several minutes the kernel will place other processes onto these
reserved isolcpus at random. This disturbs dpdk and introduces latency.

The issue does not occur with cgroups V1, so a workaround is to use
cgroups V1 instead of V2 for the moment.

[Fix]

I arrived at this commit after a full git bisect, which fixes the issue.
It landed in 6.2-rc1:

commit 7fd4da9c1584be97ffbc40e600a19cb469fd4e78
Author: Waiman Long <long...@redhat.com>
Date:   Sat Nov 12 17:19:39 2022 -0500
Subject: cgroup/cpuset: Optimize cpuset_attach() on v2
Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fd4da9c1584be97ffbc40e600a19cb469fd4e78

Only the 5.15 Jammy kernel needs this fix. Focal works correctly as is.

The commit skips calls to cpuset_attach() if the underlying cpusets or
memory have not changed in a cgroup, and it seems to fix the issue.

[Testcase]

Deploy a bare metal server, ideally with a number of cores, 56 should be plenty.
Use Jammy, with the 5.15 GA kernel.

1) Edit /etc/default/grub and set GRUB_CMDLINE_LINUX_DEFAULT to have
"isolcpus=4-7,32-35 rcu_nocb_poll rcu_nocbs=4-7,32-35 
systemd.unified_cgroup_hierarchy=1"
2) sudo reboot
3) sudo cat /sys/devices/system/cpu/isolated
4-7,32-35
4) sudo apt install s-tui stress
5) sudo s-tui
6) htop
7) $ while true; do sudo ps -eLF | head -n 1; sudo ps -eLF | grep stress | awk 
-v a="4" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="5" '$9 == a 
{print;}'; sudo ps -eLF | grep stress | awk -v a="6" '$9 == a {print;}'; sudo 
ps -eLF | grep stress | awk -v a="7" '$9 == a {print;}'; sudo ps -eLF | grep 
stress | awk -v a="32" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v 
a="33" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="34" '$9 == a 
{print;}'; sudo ps -eLF | grep stress | awk -v a="35" '$9 == a {print;}'; sleep 
5; done

Setup isolcpus to separate off 4-7 and 32-35, so each NUMA node has a
set of isolated CPUs.

s-tui is a great frontend for stress, and it starts stress processes.
All stress processes should initially be on non-isolated CPUs, confirm
this with htop, that 4-7 and 32-25 are at 0% while every other cpu is at
100%.

After 3 minutes, but sometimes it takes up to 10 minutes, a stress
process, or the s-tui process will be incorrectly placed onto an
isolated cpu, causing it to increase in usage in htop. The while script
checking ps with cpu affinities will also likely be printing the
incorrectly placed process.

A test kernel is available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf391137-test

If you install it, the processes will not be placed onto the isolated
cpus.

[Where problems could occur]

The patch changes how cgroups determines when cpuset_attach() should be
called. cpuset_attach() is currently called very frequently in the 5.15
Jammy kernel, but most operations should be NOP due to no changes
occurring in cpusets or memory in the cgroup the process is attached to.
We are changing it to instead skip calling cpuset_attach() if there are
no changes, which should offer a small performance increase, as well as
fixing this isolcpus bug.

If a regression were to occur, it would affect cgroups V2 only, and it
could cause resource limits to be applied incorrectly in the worst case.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Fix Released

** Affects: linux (Ubuntu Jammy)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress


** Tags: jammy sts

** Also affects: linux (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

** Changed in: linux (Ubuntu Jammy)
       Status: New => In Progress

** Changed in: linux (Ubuntu Jammy)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Jammy)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Description changed:

- BugLink: https://bugs.launchpad.net/bugs/
+ BugLink: https://bugs.launchpad.net/bugs/2076957
  
  [Impact]
  
  In latency sensitive environments, it is very common to use isolcpus to
  reserve a set of cpus that no other processes are to be placed on, and
  run just dpdk in poll mode.
  
  There is a bug in the jammy kernel, where if cgroups V2 are enabled,
  after several minutes the kernel will place other processes onto these
  reserved isolcpus at random. This disturbs dpdk and introduces latency.
  
  The issue does not occur with cgroups V1, so a workaround is to use
  cgroups V1 instead of V2 for the moment.
  
  [Fix]
  
  I arrived at this commit after a full git bisect, which fixes the issue.
  It landed in 6.2-rc1:
  
  commit 7fd4da9c1584be97ffbc40e600a19cb469fd4e78
  Author: Waiman Long <long...@redhat.com>
  Date:   Sat Nov 12 17:19:39 2022 -0500
  Subject: cgroup/cpuset: Optimize cpuset_attach() on v2
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fd4da9c1584be97ffbc40e600a19cb469fd4e78
  
  Only the 5.15 Jammy kernel needs this fix. Focal works correctly as is.
  
  The commit skips calls to cpuset_attach() if the underlying cpusets or
  memory have not changed in a cgroup, and it seems to fix the issue.
  
  [Testcase]
  
  Deploy a bare metal server, ideally with a number of cores, 56 should be 
plenty.
  Use Jammy, with the 5.15 GA kernel.
  
  1) Edit /etc/default/grub and set GRUB_CMDLINE_LINUX_DEFAULT to have
  "isolcpus=4-7,32-35 rcu_nocb_poll rcu_nocbs=4-7,32-35 
systemd.unified_cgroup_hierarchy=1"
  2) sudo reboot
  3) sudo cat /sys/devices/system/cpu/isolated
  4-7,32-35
  4) sudo apt install s-tui stress
  5) sudo s-tui
  6) htop
  7) $ while true; do sudo ps -eLF | head -n 1; sudo ps -eLF | grep stress | 
awk -v a="4" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="5" '$9 
== a {print;}'; sudo ps -eLF | grep stress | awk -v a="6" '$9 == a {print;}'; 
sudo ps -eLF | grep stress | awk -v a="7" '$9 == a {print;}'; sudo ps -eLF | 
grep stress | awk -v a="32" '$9 == a {print;}'; sudo ps -eLF | grep stress | 
awk -v a="33" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="34" 
'$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="35" '$9 == a 
{print;}'; sleep 5; done
  
  Setup isolcpus to separate off 4-7 and 32-35, so each NUMA node has a
  set of isolated CPUs.
  
  s-tui is a great frontend for stress, and it starts stress processes.
  All stress processes should initially be on non-isolated CPUs, confirm
  this with htop, that 4-7 and 32-25 are at 0% while every other cpu is at
  100%.
  
  After 3 minutes, but sometimes it takes up to 10 minutes, a stress
  process, or the s-tui process will be incorrectly placed onto an
  isolated cpu, causing it to increase in usage in htop. The while script
  checking ps with cpu affinities will also likely be printing the
  incorrectly placed process.
  
  A test kernel is available in the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf391137-test
  
  If you install it, the processes will not be placed onto the isolated
  cpus.
  
  [Where problems could occur]
  
  The patch changes how cgroups determines when cpuset_attach() should be
  called. cpuset_attach() is currently called very frequently in the 5.15
  Jammy kernel, but most operations should be NOP due to no changes
  occurring in cpusets or memory in the cgroup the process is attached to.
  We are changing it to instead skip calling cpuset_attach() if there are
  no changes, which should offer a small performance increase, as well as
  fixing this isolcpus bug.
  
  If a regression were to occur, it would affect cgroups V2 only, and it
  could cause resource limits to be applied incorrectly in the worst case.

** Tags added: jammy sts

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2076957

Title:
  isolcpus are ignored when using cgroups V2, causing processes to have
  wrong affinity

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2076957/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to