Public bug reported:
[Impact]
libvirt's virHostCPUParseNode()(src/util/virhostcpu.c) derives the host
socket count from the maximum physical_package_id value read from
/sys/devices/system/cpu/cpu*/topology/physical_package_id.
On systems where physical_package_id is not contiguous or zero-based(for
example NVIDIA GB200),
These identifiers can be very large arbitrary numbers (e.g. 256123234).
libvirt then allocates an array (cores_maps) sized to that maximum value and
creates one virBitmap per slot.
With a package id of ~10^9 this becomes a multi-gigabyte allocation plus ~10^9
allocation calls, causing excessive memory use, long CPU time, and OOM / denial
of service whenever libvirt inspects host CPU topology (virsh capabilities,
virsh nodeinfo, domain start, etc.).
[Test Plan]
1. Prepare a fresh VM.
2. Install libvirt:
sudo apt-get update
sudo apt-get install -y libvirt-daemon-system libvirt-clients
3. Give CPU 0 a large physical_package_id and query host node info
(sysfs is read-only, so the value is supplied with a bind mount):
ppath=/sys/devices/system/cpu/cpu0/topology/physical_package_id
echo 999999999 | sudo tee /tmp/ppid
sudo mount --bind /tmp/ppid "$ppath"
sudo virsh nodeinfo
sudo umount "$ppath"
4. Result:
- Without the fix
virHostCPUParseNode() derives the socket count from the package id, so
cores_maps is sized for ~10^9 sockets (an 8 GB array) and the daemon then
begins allocating a bitmap per socket.
The libvirt daemon's memory balloons until the kernel OOM-killer terminates it.
"virsh nodeinfo" fails:
error: Disconnected from qemu:///system due to end of file
error: failed to get node information
error: End of file while reading data: Input/output error
...
Out of memory: Killed process <pid> (libvirtd) total-vm:~15 GB
- With the fix
the large id is counted as a single unique socket and "virsh nodeinfo" returns
the host information normally("CPU socket(s): 1").
[Where problems could occur]
The change is limited to virHostCPUParseNode() in src/util/virhostcpu.c.
Instead of using the maximum physical_package_id value as the socket count, it
now counts unique physical_package_id values and maps them to sequential socket
indexes using a GHashTable.
If a regression occurs, it would most likely appear as incorrect host CPU
topology reporting (such as sockets, cores, or threads in virsh nodeinfo or
capabilities output).
The change does not affect guest handling, migration, or any on-disk state.
[Other Info]
Fixed upstream in libvirt 12.1.0 by
https://github.com/libvirt/libvirt/commit/a64367115015df58e0d82635a40d76df56144c60
commit a64367115015df58e0d82635a40d76df56144c60.
util: Fix max socket calculation
Affected Ubuntu releases (libvirt < 12.1.0)
focal 6.0.0-0ubuntu8
jammy 8.0.0-1ubuntu7
noble 10.0.0-2ubuntu8
questing 11.6.0-1ubuntu3
resolute 12.0.0-1ubuntu5
** Affects: libvirt (Ubuntu)
Importance: Undecided
Status: New
** Affects: libvirt (Ubuntu Focal)
Importance: Undecided
Status: New
** Affects: libvirt (Ubuntu Jammy)
Importance: Undecided
Status: New
** Affects: libvirt (Ubuntu Noble)
Importance: Undecided
Status: New
** Affects: libvirt (Ubuntu Resolute)
Importance: Undecided
Status: New
** Also affects: libvirt (Ubuntu Stonking)
Importance: Undecided
Status: New
** Also affects: libvirt (Ubuntu Resolute)
Importance: Undecided
Status: New
** Also affects: libvirt (Ubuntu Noble)
Importance: Undecided
Status: New
** Also affects: libvirt (Ubuntu Jammy)
Importance: Undecided
Status: New
** Also affects: libvirt (Ubuntu Focal)
Importance: Undecided
Status: New
** No longer affects: libvirt (Ubuntu Stonking)
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2153530
Title:
libvirt: excessive memory allocation / OOM when physical_package_id is
large
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/2153530/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs