** Also affects: nova/stein Importance: Undecided Status: New ** Changed in: nova/stein Importance: Undecided => Medium
-- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1834712 Title: ResourceTracker._update should restore previous old_resources value if ComputeNode.save fails Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) stein series: In Progress Bug description: This is a follow up to bug 1834694 with the debug information here: https://review.opendev.org/#/c/668252/1/nova/scheduler/host_manager.py@626 This is on an overloaded system where conductor and mysql are having problems and database connections are getting dropped. On the first start of the compute service, the compute node record is created without the free_disk_gb field set. Later in the _update() method in ResourceTracker the _resource_change method returns True and updates the self.old_resources value: https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L908 Then the ComputeNode.save() fails with a DB error here: https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L1010 That kills the update_available_resource run but doesn't kill the service because: https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/manager.py#L8130 Later when update_available_resource runs, _resource_change does not detect any changes here because old_resources was updated before: https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L906 So we don't try to call ComputeNode.save() again but instead call _update_to_placement here: https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L1012 This can create the resource provider with inventory in the placement service. As a result, the scheduler can get the compute node resource provider back from placement even though it's not updated which results in hitting this code in the scheduler: https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/scheduler/host_manager.py#L193 That leaves some of the HostState fields unset which in turn results in issues like bug 1834691 and bug 1834694. We could deal with the RT issues in a few ways, like not allowing the compute service to start if we can't create and update the compute node (rather than just catch and swallow Exception in the ComputeManager), but that might have other side effects. An easy thing to do here is make sure to rollback the changes to old_resources in the RT if compute_node.save() fails. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1834712/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp