stable/queens is EOL ** Changed in: nova/queens Status: In Progress => Won't Fix
-- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1899541 Title: archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid' Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) queens series: Won't Fix Status in OpenStack Compute (nova) rocky series: Won't Fix Status in OpenStack Compute (nova) stein series: Fix Released Status in OpenStack Compute (nova) train series: Fix Released Status in OpenStack Compute (nova) ussuri series: Fix Released Status in OpenStack Compute (nova) victoria series: Fix Released Bug description: This is based on a bug reported downstream [1] where after a random amount of time, update_available_resource began to fail with the following trace on nodes with PCI devices: "traceback": [ "Traceback (most recent call last):", " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node", " rt.update_available_resource(context, nodename)", " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource", " self._update_available_resource(context, resources)", " File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner", " return f(*args, **kwargs)", " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource", " self._update(context, cn)", " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update", " self.pci_tracker.save(context)", " File \"/usr/lib/python2.7/site-packages/nova/pci/manager.py\", line 92, in save", " dev.save()", " File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 210, in wrapper", " ctxt, self, fn.__name__, args, kwargs)", " File \"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py\", line 245, in object_action", " objmethod=objmethod, args=args, kwargs=kwargs)", " File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call", " retry=self.retry)", " File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send", " timeout=timeout, retry=retry)", " File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send", " retry=retry)", " File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send", " raise result", "RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at: http://sqlalche.me/e/gkpj)", Here ^ we see an attempt to insert a nearly empty (NULL fields) record into the pci_devices table. Inspection of the code shows that the way this can occur is if we fail to lookup the pci_devices record we want and then we try to create a new one [2]: @pick_context_manager_writer def pci_device_update(context, node_id, address, values): query = model_query(context, models.PciDevice, read_deleted="no").\ filter_by(compute_node_id=node_id).\ filter_by(address=address) if query.update(values) == 0: device = models.PciDevice() device.update(values) context.session.add(device) return query.one() Turns out what was happening was when a request came in to delete an instance that had allocated a PCI device, if the archive_deleted_rows cron job fired at just the right (wrong) moment, it would sweep away the pci_devices record matching the instance_uuid because archive is treating any table with an 'instance_uuid' column as instance "residue" needing cleanup. So after the pci_devices record was swept away, we tried to update the resource tracker as part of the _complete_deletion method in the compute manager and that failed because we could not locate the pci_devices record to free the PCI device (null out the instance_uuid field). What we need to do here is not to treat the pci_devices table records as instance residue. The records in pci_devices are not tied to instance lifecycles at all and they are managed independently by the PCI trackers. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1867124 [2] https://github.com/openstack/nova/blob/261de76104ca67bed3ea6cdbcaaab0e44030f1e2/nova/db/sqlalchemy/api.py#L4406-L4409 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1899541/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp