Public bug reported: Running a small cluster with 16 compute nodes and 3 controller nodes on OpenStack Queens using SR-IOV VFs. From time to time, it appears that the Nova scheduler loses track of some of the PCI devices (VFs) that are actively mapped into servers. We don't know exactly when this occurs and we cannot trigger it on demand, but it occurs on a number of the compute nodes over time. Restarting the given compute node resolves the issue.
The problem is manifest with the following errors: /var/log/nova/nova-conductor.log:2019-05-03 01:35:27.309 13073 ERROR nova.scheduler.utils [req-8418eb3a-4118-4505-97e3-fffbaae7aae6 2469493ff8b546ff9a6f4e339cc50ac2 33bb32d9463340bca0bb72a8c36579a9 - default default] [instance: b2b4dbf2-d381-4416-95c9-b410aa6d8377] Error from last host: node05 (node {REDACTED}): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist- packages/nova/compute/manager.py", line 1828, in _do_build_and_run_instance\n filter_properties, request_spec)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2108, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance b2b4dbf2-d381-4416-95c9-b410aa6d8377 was re-scheduled: Requested operation is not valid: PCI device 0000:04:01.3 is in use by driver QEMU, domain instance-00001466\n'] The compute nodes in question are configured with the following PCI whitelist: [pci] passthrough_whitelist = [{"vendor_id": "15b3", "product_id": "1004"}] Note the, despite similar bugs, there haven't been changes to the whitelist that would likely cause this to occur. It just seems to develop over time. ===== Versions ===== Compute nodes: ii nova-common 2:17.0.6-0ubuntu1 all OpenStack Compute - common files ii nova-compute 2:17.0.6-0ubuntu1 all OpenStack Compute - compute node base ii nova-compute-kvm 2:17.0.6-0ubuntu1 all OpenStack Compute - compute node (KVM) ii nova-compute-libvirt 2:17.0.6-0ubuntu1 all OpenStack Compute - compute node libvirt support Controller nodes: ii nova-api 2:17.0.9-0ubuntu1 all OpenStack Compute - API frontend ii nova-common 2:17.0.9-0ubuntu1 all OpenStack Compute - common files ii nova-compute 2:17.0.9-0ubuntu1 all OpenStack Compute - compute node base ii nova-compute-kvm 2:17.0.9-0ubuntu1 all OpenStack Compute - compute node (KVM) ii nova-compute-libvirt 2:17.0.9-0ubuntu1 all OpenStack Compute - compute node libvirt support ii nova-conductor 2:17.0.9-0ubuntu1 all OpenStack Compute - conductor service ii nova-consoleauth 2:17.0.9-0ubuntu1 all OpenStack Compute - Console Authenticator ii nova-novncproxy 2:17.0.9-0ubuntu1 all OpenStack Compute - NoVNC proxy ii nova-placement-api 2:17.0.9-0ubuntu1 all OpenStack Compute - placement API frontend ii nova-scheduler 2:17.0.9-0ubuntu1 all OpenStack Compute - virtual machine scheduler ii nova-serialproxy 2:17.0.9-0ubuntu1 all OpenStack Compute - serial proxy ii nova-xvpvncproxy 2:17.0.9-0ubuntu1 all OpenStack Compute - XVP VNC proxy ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1827453 Title: Nova scheduler attempts to re-assign currently in-use SR-IOV VF to new VM Status in OpenStack Compute (nova): New Bug description: Running a small cluster with 16 compute nodes and 3 controller nodes on OpenStack Queens using SR-IOV VFs. From time to time, it appears that the Nova scheduler loses track of some of the PCI devices (VFs) that are actively mapped into servers. We don't know exactly when this occurs and we cannot trigger it on demand, but it occurs on a number of the compute nodes over time. Restarting the given compute node resolves the issue. The problem is manifest with the following errors: /var/log/nova/nova-conductor.log:2019-05-03 01:35:27.309 13073 ERROR nova.scheduler.utils [req-8418eb3a-4118-4505-97e3-fffbaae7aae6 2469493ff8b546ff9a6f4e339cc50ac2 33bb32d9463340bca0bb72a8c36579a9 - default default] [instance: b2b4dbf2-d381-4416-95c9-b410aa6d8377] Error from last host: node05 (node {REDACTED}): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist- packages/nova/compute/manager.py", line 1828, in _do_build_and_run_instance\n filter_properties, request_spec)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2108, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance b2b4dbf2-d381-4416-95c9-b410aa6d8377 was re-scheduled: Requested operation is not valid: PCI device 0000:04:01.3 is in use by driver QEMU, domain instance-00001466\n'] The compute nodes in question are configured with the following PCI whitelist: [pci] passthrough_whitelist = [{"vendor_id": "15b3", "product_id": "1004"}] Note the, despite similar bugs, there haven't been changes to the whitelist that would likely cause this to occur. It just seems to develop over time. ===== Versions ===== Compute nodes: ii nova-common 2:17.0.6-0ubuntu1 all OpenStack Compute - common files ii nova-compute 2:17.0.6-0ubuntu1 all OpenStack Compute - compute node base ii nova-compute-kvm 2:17.0.6-0ubuntu1 all OpenStack Compute - compute node (KVM) ii nova-compute-libvirt 2:17.0.6-0ubuntu1 all OpenStack Compute - compute node libvirt support Controller nodes: ii nova-api 2:17.0.9-0ubuntu1 all OpenStack Compute - API frontend ii nova-common 2:17.0.9-0ubuntu1 all OpenStack Compute - common files ii nova-compute 2:17.0.9-0ubuntu1 all OpenStack Compute - compute node base ii nova-compute-kvm 2:17.0.9-0ubuntu1 all OpenStack Compute - compute node (KVM) ii nova-compute-libvirt 2:17.0.9-0ubuntu1 all OpenStack Compute - compute node libvirt support ii nova-conductor 2:17.0.9-0ubuntu1 all OpenStack Compute - conductor service ii nova-consoleauth 2:17.0.9-0ubuntu1 all OpenStack Compute - Console Authenticator ii nova-novncproxy 2:17.0.9-0ubuntu1 all OpenStack Compute - NoVNC proxy ii nova-placement-api 2:17.0.9-0ubuntu1 all OpenStack Compute - placement API frontend ii nova-scheduler 2:17.0.9-0ubuntu1 all OpenStack Compute - virtual machine scheduler ii nova-serialproxy 2:17.0.9-0ubuntu1 all OpenStack Compute - serial proxy ii nova-xvpvncproxy 2:17.0.9-0ubuntu1 all OpenStack Compute - XVP VNC proxy To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1827453/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp