Reviewed: https://review.opendev.org/c/openstack/nova/+/933734 Committed: https://opendev.org/openstack/nova/commit/2c76fd3bafc90b23ed9d9e6a7f84919082dc0076 Submitter: "Zuul (22348)" Branch: master
commit 2c76fd3bafc90b23ed9d9e6a7f84919082dc0076 Author: Balazs Gibizer <g...@redhat.com> Date: Wed Oct 30 13:24:41 2024 +0100 Route shared storage RPC to evac dest at startup If a compute is started up while an evacuation of an instance from this host is still in progress then the destroy_evacuated_instances call will try to check if the instance is on shared storage to decide if the local disk needs to deleted from the source node or not. However this call uses the instance.host to target the RPC call. If the evacuation is still ongoing then the instance.host might still be set to the source node. This means the source node during init_host tries to call RPC on itself. This will always time out as the RPC server is only started after init_host. Also it is wrong as the shared storage check RPC should be called on another host. Moreover when this wrongly routed RPC times out the source compute logs the exception, ignores it, and the assume the disk is on shared storage so won't clean it up. This means that a later evacuation of this VM targeting this node will fails as the instance directory is already present on the node. The fix is simple, the destroy_evacuated_instances call should always send the shared storage check RPC call to the destination node of the evacuation based on the migration record. It will be correct even if the evacuation is still in progress or even if it is already finished. Closes-Bug: #2085975 Change-Id: If5ad213649d68da995dad146f0a0c3cacc369309 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2085975 Title: Compute fails to clean up after evacuated instance if the evacuation still in progress Status in OpenStack Compute (nova): Fix Released Bug description: Reproduce: * have a two node devstack hostA, hostB both with simple local storage * start an instance on hostA * inject a sleep in nova.virt.driver.rebuild to simulate that rebuild take time * stop hostA * evacuate the VM * while the evacuation is still in progress on hostB start up hostA Actual: hostA will try to check if the VM is using shared storage and sends an RPC call to the instance.host as that is not yet set to the destination the RPC call hits hostA that is still in init_host so the RPC never answered and hostA'a destroy_evacuated_instances call will get a MessagingTimeout exception. That is logged and then ignored. But nova defaults the shared_storage flag to true so in this case the local instance dir is not cleaned. Expected: hostA sends the RPC call to hostB that responds and the local instance dir on hostkA is cleaned up. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2085975/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp