Reviewed: https://review.opendev.org/c/openstack/nova/+/939317 Committed: https://opendev.org/openstack/nova/commit/f304b9eaadfd33c7ccdd6af2f60f299c3362ba1c Submitter: "Zuul (22348)" Branch: master
commit f304b9eaadfd33c7ccdd6af2f60f299c3362ba1c Author: melanie witt <melwi...@gmail.com> Date: Fri Oct 18 02:54:02 2024 +0000 libvirt: Wrap un-proxied listDevices() and listAllDevices() This is similar to change I668643c836d46a25df46d4c99a973af5e50a39db where the objects returned in a list from a libvirt call were not tpool.Proxy wrapped. Because the objects are not wrapped, calling methods on them such as listCaps() can block all other greenthreads and can cause nova-compute to freeze for hours in certain scenarios. This adds the same wrapping to libvirt calls which return lists of virNodeDevice. Closes-Bug: #2091033 Change-Id: I60d6f04d374e9ede5895a43b7a75e955b0fea3c5 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2091033 Title: Un-proxied libvirt calls list(All)Devices() can cause nova-compute to freeze for hours Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) 2024.1 series: In Progress Status in OpenStack Compute (nova) 2024.2 series: In Progress Status in OpenStack Compute (nova) antelope series: In Progress Status in OpenStack Compute (nova) bobcat series: In Progress Bug description: tl;dr This bug has the same root cause as https://bugs.launchpad.net/nova/+bug/1840912 where items in lists returned from libvirt are not automatically wrapped in a tpool.Proxy. Discovered during investigation of a downstream bug [1] where a live migration was dirtying memory faster than the transfer and nova- compute became frozen unable to perform any other operations, not even logging, for hours. The freezing was tracked down to un-proxied libvirt call listAllDevices() which could block all other greenthreads. The listAllDevices() call occurs during the update_available_resource() periodic task in the libvirt driver in _get_pci_passthrough_devices(). In a GMR collected during a repro of the issue, a traceback showing this was present in the report [2]: tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in run_periodic_tasks stderr F `task(self, context)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in update_available_resource stderr F `startup=startup)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in _update_available_resource_for_node stderr F `startup=startup)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in update_available_resource stderr F `resources = self.driver.get_available_resource(nodename)` stderr F stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in get_available_resource stderr F `data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()` stderr F stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in _get_pci_passthrough_devices stderr F `in devices.items() if "pci" in dev.listCaps()]` stderr F stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in <listcomp> stderr F `in devices.items() if "pci" in dev.listCaps()]` stderr F stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps stderr F `ret = libvirtmod.virNodeDeviceListCaps(self._o)` The listAllDevices() function returned a list of unwrapped virNodeDevice objects and so calling listCaps() on such an unwrapped device could cause a freeze. Based on the above, the bug reporter was able to test a patch [3] to wrap listAllDevices() list items in tpool.Proxy and the result showed nova-compute no longer freezing [4] in the aforementioned scenario. During investigation it was also noticed that the listDevices() call list items were not tpool.Proxy wrapped, so this is fixed as well in the patch. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196 [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13 [3] https://review.opendev.org/c/openstack/nova/+/932669 [4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2091033/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp