Public bug reported:

Description
===========

Nova presumes that all the host's CPUs can schedule and execute vCPUs.
This assumption is wrong because the operator might want to allocate a
specific subset of the host's available CPUs to execute vCPU code.

Given that Nova uses libvirt when deployed on Linux hosts, all major Linux 
distributions run systemd and by default, libvirt uses the machine.slice cgroup 
to spawn virtual machines, an easy way for an operator to limit the scheduling 
of vCPUs to a subset of the host's CPUs by using the cpuset controller for the 
machine.slice cgroup. 
Going a step further, the operator might need to allocate another subset of the 
host's CPUs to a latency-sensitive application like a software-defined storage 
solution or a database. The operator sets the cpu_exclusive bit on the custom 
cpuset to ensure the kernel won't schedule any other process on the CPU subset 
allocated for the latency-sensitive application.

The above scenario leads to an error when Nova attempts to spawn an
instance because the kernel throws a "Permission denied" error when
libvirt tries to create a child cgroup with a cpuset containing all host
CPUs. This violates the constraint imposed by the cpu_exclusive bit in
the custom cpuset and the kernel returns an error. This is valid for
both cgroup v1 and v2.

A workaround to this problem is setting the cpu_shared_set to be equal
to the cpuset set to the machine.slice cgroup. However, using the
workaround is cumbersome when it comes to automated deployment on a
heterogeneous fleet of hosts.

A better approach would be for Nova to check if there is a machine.slice
cgroup and if this cgroup has a defined cpuset. If there is a defined
machine.slice cpuset, then Nova should consider the cpuset defined in
the machine.slice cgroup for computing the list of CPUs that can
schedule vCPUs unless cpu_shared_set is defined. If the machine.slice
cpuset is empty or the machine.slice cgroup does not exist at all, then
consider all online CPUs as schedulable.

Reproduction
============

1. Deploy the Nova compute agent on a host with the default configuration.
2. Create a custom cgroup with an exclusive cpuset:
   # mkdir -p /sys/fs/cgroup/cpuset/test-group1
   # echo "1" > /sys/fs/cgroup/cpuset/test-group1/cpuset.cpus
   # echo "1" > /sys/fs/cgroup/cpuset/test-group1/cpuset.cpu_exclusive
3. Spawn an instance on the target host

Expected result
===============

The instance should be spawned successfully.

Actual result
=============

The instance fails to spawn.

Environment
===========

1. OpenStack version: latest upstream

   commit 932866d078cdec51ad654aa0626a635e65975b7f (HEAD -> master, 
origin/master, origin/HEAD)
   Merge: 3d21445b73 26d174b65d
   Author: Zuul <z...@review.opendev.org>
   Date:   Wed Jan 22 18:30:38 2025 +0000

      Merge "Run nova-next without periodic cache healing"

2. Hypervisor: QEMU/KVM via libvirt

   Compiled against library: libvirt 8.0.0
   Using library: libvirt 8.0.0
   Using API: QEMU 8.0.0
   Running hypervisor: QEMU 6.2.0

3. No storage used, booted from an image
4. No networking used

Logs & Config
=============

An error excerpt from the nova-compute process in the attached files

** Affects: nova
     Importance: Undecided
         Status: New

** Attachment added: "error.txt"
   https://bugs.launchpad.net/bugs/2095591/+attachment/5853602/+files/error.txt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2095591

Title:
  CPUs in exclusive cpusets are used for scheduling vCPUs

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  Nova presumes that all the host's CPUs can schedule and execute vCPUs.
  This assumption is wrong because the operator might want to allocate a
  specific subset of the host's available CPUs to execute vCPU code.

  Given that Nova uses libvirt when deployed on Linux hosts, all major Linux 
distributions run systemd and by default, libvirt uses the machine.slice cgroup 
to spawn virtual machines, an easy way for an operator to limit the scheduling 
of vCPUs to a subset of the host's CPUs by using the cpuset controller for the 
machine.slice cgroup. 
  Going a step further, the operator might need to allocate another subset of 
the host's CPUs to a latency-sensitive application like a software-defined 
storage solution or a database. The operator sets the cpu_exclusive bit on the 
custom cpuset to ensure the kernel won't schedule any other process on the CPU 
subset allocated for the latency-sensitive application.

  The above scenario leads to an error when Nova attempts to spawn an
  instance because the kernel throws a "Permission denied" error when
  libvirt tries to create a child cgroup with a cpuset containing all
  host CPUs. This violates the constraint imposed by the cpu_exclusive
  bit in the custom cpuset and the kernel returns an error. This is
  valid for both cgroup v1 and v2.

  A workaround to this problem is setting the cpu_shared_set to be equal
  to the cpuset set to the machine.slice cgroup. However, using the
  workaround is cumbersome when it comes to automated deployment on a
  heterogeneous fleet of hosts.

  A better approach would be for Nova to check if there is a
  machine.slice cgroup and if this cgroup has a defined cpuset. If there
  is a defined machine.slice cpuset, then Nova should consider the
  cpuset defined in the machine.slice cgroup for computing the list of
  CPUs that can schedule vCPUs unless cpu_shared_set is defined. If the
  machine.slice cpuset is empty or the machine.slice cgroup does not
  exist at all, then consider all online CPUs as schedulable.

  Reproduction
  ============

  1. Deploy the Nova compute agent on a host with the default configuration.
  2. Create a custom cgroup with an exclusive cpuset:
     # mkdir -p /sys/fs/cgroup/cpuset/test-group1
     # echo "1" > /sys/fs/cgroup/cpuset/test-group1/cpuset.cpus
     # echo "1" > /sys/fs/cgroup/cpuset/test-group1/cpuset.cpu_exclusive
  3. Spawn an instance on the target host

  Expected result
  ===============

  The instance should be spawned successfully.

  Actual result
  =============

  The instance fails to spawn.

  Environment
  ===========

  1. OpenStack version: latest upstream

     commit 932866d078cdec51ad654aa0626a635e65975b7f (HEAD -> master, 
origin/master, origin/HEAD)
     Merge: 3d21445b73 26d174b65d
     Author: Zuul <z...@review.opendev.org>
     Date:   Wed Jan 22 18:30:38 2025 +0000

        Merge "Run nova-next without periodic cache healing"

  2. Hypervisor: QEMU/KVM via libvirt

     Compiled against library: libvirt 8.0.0
     Using library: libvirt 8.0.0
     Using API: QEMU 8.0.0
     Running hypervisor: QEMU 6.2.0

  3. No storage used, booted from an image
  4. No networking used

  Logs & Config
  =============

  An error excerpt from the nova-compute process in the attached files

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2095591/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to