Reviewed:  https://review.opendev.org/c/openstack/nova/+/939317
Committed: 
https://opendev.org/openstack/nova/commit/f304b9eaadfd33c7ccdd6af2f60f299c3362ba1c
Submitter: "Zuul (22348)"
Branch:    master

commit f304b9eaadfd33c7ccdd6af2f60f299c3362ba1c
Author: melanie witt <melwi...@gmail.com>
Date:   Fri Oct 18 02:54:02 2024 +0000

    libvirt: Wrap un-proxied listDevices() and listAllDevices()
    
    This is similar to change I668643c836d46a25df46d4c99a973af5e50a39db
    where the objects returned in a list from a libvirt call were not
    tpool.Proxy wrapped. Because the objects are not wrapped, calling
    methods on them such as listCaps() can block all other greenthreads
    and can cause nova-compute to freeze for hours in certain scenarios.
    
    This adds the same wrapping to libvirt calls which return lists of
    virNodeDevice.
    
    Closes-Bug: #2091033
    
    Change-Id: I60d6f04d374e9ede5895a43b7a75e955b0fea3c5


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2091033

Title:
  Un-proxied libvirt calls list(All)Devices() can cause nova-compute to
  freeze for hours

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) 2024.1 series:
  In Progress
Status in OpenStack Compute (nova) 2024.2 series:
  In Progress
Status in OpenStack Compute (nova) antelope series:
  In Progress
Status in OpenStack Compute (nova) bobcat series:
  In Progress

Bug description:
  tl;dr This bug has the same root cause as
  https://bugs.launchpad.net/nova/+bug/1840912 where items in lists
  returned from libvirt are not automatically wrapped in a tpool.Proxy.

  Discovered during investigation of a downstream bug [1] where a live
  migration was dirtying memory faster than the transfer and nova-
  compute became frozen unable to perform any other operations, not even
  logging, for hours.

  The freezing was tracked down to un-proxied libvirt call
  listAllDevices() which could block all other greenthreads. The
  listAllDevices() call occurs during the update_available_resource()
  periodic task in the libvirt driver in _get_pci_passthrough_devices().
  In a GMR collected during a repro of the issue, a traceback showing
  this was present in the report [2]:

  tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in 
run_periodic_tasks
  stderr F     `task(self, context)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in 
update_available_resource
  stderr F     `startup=startup)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in 
_update_available_resource_for_node
  stderr F     `startup=startup)`
  stderr F
  stderr F 
/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in 
update_available_resource
  stderr F     `resources = self.driver.get_available_resource(nodename)`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in 
get_available_resource
  stderr F     `data['pci_passthrough_devices'] = 
self._get_pci_passthrough_devices()`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in 
_get_pci_passthrough_devices
  stderr F     `in devices.items() if "pci" in dev.listCaps()]`
  stderr F
  stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in 
<listcomp>
  stderr F     `in devices.items() if "pci" in dev.listCaps()]`
  stderr F
  stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps
  stderr F     `ret = libvirtmod.virNodeDeviceListCaps(self._o)`

  The listAllDevices() function returned a list of unwrapped
  virNodeDevice objects and so calling listCaps() on such an unwrapped
  device could cause a freeze.

  Based on the above, the bug reporter was able to test a patch [3] to
  wrap listAllDevices() list items in tpool.Proxy and the result showed
  nova-compute no longer freezing [4] in the aforementioned scenario.

  During investigation it was also noticed that the listDevices() call
  list items were not tpool.Proxy wrapped, so this is fixed as well in
  the patch.

  [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13
  [3] https://review.opendev.org/c/openstack/nova/+/932669
  [4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2091033/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to