Public bug reported: Description ===========
This bug was found during stress testing cold migration revert for an instance with an SR-IOV PCI device. Migrate then revert migrate are executed in a loop for the same guest. The exact problem was found to be a race condition with the resource tracker periodic audit. When the audit is triggered after revert_resize() is executed on the destination (before finish_revert_resize() is executed on the source node), the resource tracker on the source gets the list of all instances and migrations running on this node. It then call the PCI manager clean_usage() to clean any PCI resources that would be allocated to an instance that was not found to be running (or found that no migration is in progress for this instance). So what happens in this case is that the PCI devices for the instance on the source node are being freed while the revert is done finished executing. Following this condition, the instance actually succeed to revert, but the PCI devices are removed from instance.pci_devices because of the clean_usage(). [1] On the following migration, the instance fails on finish_resize() because we can't correlate the neutron port because the migration_context.old_pci_devices is None [2]. Steps to reproduce ================== 1) Configure a multi-node (two nodes) setup with SR-IOV enabled on both computes 2) Boot a guest with SR-IOV device 3) In a loop, do: a) Cold migrate the guest b) Revert resize the guest Expected result =============== It should be valid to successfully revert migrate a guest over and over (multiple times in a row). Actual result ============= What happened instead of the expected result? How did the issue look like? Environment =========== commit 633c817de5a67e798d8610d0df1135e5a568fd8a Author: Matt Riedemann <mrie...@us.ibm.com> Date: Sat Nov 12 11:59:13 2016 -0500 api-ref: fix server_id in metadata docs The api-ref was saying that the server_id was in the body of the server metadata requests but it's actually in the path for all of the requests. Change-Id: Icdecd980767f89ee5fcc5bdd4802b2c263268a26 Closes-Bug: #1641331 Logs & Configs ============== In the output below, all the logs with '***' are code I've added in order to trouble shoot the issue. [1] On source node (last successful revert migrate): 2016-11-22 17:35:11.875 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_available_resource 2016-11-22 17:35:11.936 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_usage_from_instance 2016-11-22 17:35:11.971 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_usage_from_migrations 2016-11-22 17:35:11.972 ERROR nova.pci.manager [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** clean_usage 2016-11-22 17:35:11.973 ERROR nova.pci.stats [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** Adding device 0000:85:11.7 2016-11-22 17:35:11.973 ERROR nova.pci.stats [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** Pool is now: [vendor:8086, product:10ed, numa:1, physnet:default, dev_type:type-VF]: ['0000:85:10.1', '0000:85:10.3', '0000:85:10.5', '0000:85:11.1', '0000:85:11.3', '0000:85:11.7'] 2016-11-22 17:35:33.817 ERROR nova.compute.manager [req-fba89cf1-21bf-482f-9b1d-3d634f5b3d73 admin admin] *** finish_revert_resize 2016-11-22 17:35:35.903 INFO nova.compute.manager [req-7d27d388-0613-451b-a9da-34dde9d5d0f2 None None] [instance: e99dc7ad-3c1c-4cd0-96d1-00e6cbcb8725] VM Resumed (Lifecycle Event) 2016-11-22 17:35:35.999 INFO nova.compute.manager [req-7d27d388-0613-451b-a9da-34dde9d5d0f2 None None] [instance: e99dc7ad-3c1c-4cd0-96d1-00e6cbcb8725] VM Started (Lifecycle Event) [2] On destination node (next migration): 2016-11-22 17:41:31.739 ERROR nova.compute.manager [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** finish_resize - uuid: 19d615e1-04d3-42ea-b30f-f6cfc25f0397 2016-11-22 17:41:31.740 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** _update_port_binding_for_instance 2016-11-22 17:41:31.793 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** port: {u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'updated_at': u'2016-11-22T22:41:31Z', u'device_owner': u'compute:None', u'revision_number': 5011, u'port_security_enabled': True, u'binding:profile': {u'pci_slot': u'0000:85:11.7', u'physical_network': u'default', u'pci_vendor_info': u'8086:10ed'}, u'fixed_ips': [{u'subnet_id': u'd9c807e8-f5d3-4eb8-9d1e-dee0edc922b7', u'ip_address': u'10.0.0.6'}, {u'subnet_id': u'd1a618f1-217b-4a53-9bd4-c89cb8e55316', u'ip_address': u'2001:db8:8000:0:f816:3eff:fee7:b3d'}], u'id': u'd55c021a-d45c-4cfa-9607-11cf572b687e', u'security_groups': [u'fa65d63f-7466-47c0-aa56-bd827077fed9'], u'binding:vif_details': {u'port_filter': False, u'vlan': u'468'}, u'binding:vif_type': u'hw_veb', u'mac_address': u'fa:16:3e:e7:0b:3d', u'project_id': u'caa3cffc0402480ba611bd4ccb87492a', u'status': u'DOWN', u'binding:host_id': u'IronPass-3', u'de scription': u'', u'device_id': u'19d615e1-04d3-42ea-b30f-f6cfc25f0397', u'name': u'sriov_port_4', u'admin_state_up': True, u'network_id': u'3a4e3f83-7b84-4af8-be90-288f944f8fa7', u'tenant_id': u'caa3cffc0402480ba611bd4ccb87492a', u'created_at': u'2016-11-21T17:10:33Z', u'binding:vnic_type': u'direct'} 2016-11-22 17:41:31.793 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** _get_pci_mapping_for_migration 2016-11-22 17:41:31.794 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context: MigrationContext(created_at=<?>,deleted=<?>,deleted_at=<?>,instance_uuid=19d615e1-04d3-42ea-b30f-f6cfc25f0397,migration_id=2127,new_numa_topology=None,new_pci_devices=PciDeviceList,new_pci_requests=InstancePCIRequests,old_numa_topology=None,old_pci_devices=PciDeviceList,old_pci_requests=InstancePCIRequests,updated_at=<?>) 2016-11-22 17:41:31.794 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context.old_pci_devices.objects: [] 2016-11-22 17:41:31.795 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context.new_pci_devices.objects: [PciDevice(address='0000:05:11.5',compute_node_id=1,created_at=2016-11-14T15:44:04Z,deleted=False,deleted_at=None,dev_id='pci_0000_05_11_5',dev_type='type-VF',extra_info={},id=7,instance_uuid='19d615e1-04d3-42ea-b30f-f6cfc25f0397',label='label_8086_10ed',numa_node=0,parent_addr='0000:05:00.1',product_id='10ed',request_id='06dd5913-5960-4fc9-ac1c-4f0bdfe344a7',status='claimed',updated_at=2016-11-22T22:40:17Z,vendor_id='8086')] 2016-11-22 17:41:31.795 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** pci_mapping: {} 2016-11-22 17:41:31.795 ERROR nova.compute.manager [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] Setting instance vm_state to ERROR 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] Traceback (most recent call last): 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] File "/opt/stack/nova/nova/compute/manager.py", line 4003, in finish_resize 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] disk_info, image_meta) 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] File "/opt/stack/nova/nova/compute/manager.py", line 3935, in _finish_resize 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] migration_p) 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] File "/opt/stack/nova/nova/network/neutronv2/api.py", line 2033, in migrate_instance_finish 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] migration=migration) 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] File "/opt/stack/nova/nova/network/neutronv2/api.py", line 2424, in _update_port_binding_for_instance 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] pci_slot) 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] PortUpdateFailed: Port update failed for port d55c021a-d45c-4cfa-9607-11cf572b687e: Unable to correlate PCI slot 0000:85:11.7 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1644042 Title: Race condition with audit on migrate revert with PCI devices Status in OpenStack Compute (nova): New Bug description: Description =========== This bug was found during stress testing cold migration revert for an instance with an SR-IOV PCI device. Migrate then revert migrate are executed in a loop for the same guest. The exact problem was found to be a race condition with the resource tracker periodic audit. When the audit is triggered after revert_resize() is executed on the destination (before finish_revert_resize() is executed on the source node), the resource tracker on the source gets the list of all instances and migrations running on this node. It then call the PCI manager clean_usage() to clean any PCI resources that would be allocated to an instance that was not found to be running (or found that no migration is in progress for this instance). So what happens in this case is that the PCI devices for the instance on the source node are being freed while the revert is done finished executing. Following this condition, the instance actually succeed to revert, but the PCI devices are removed from instance.pci_devices because of the clean_usage(). [1] On the following migration, the instance fails on finish_resize() because we can't correlate the neutron port because the migration_context.old_pci_devices is None [2]. Steps to reproduce ================== 1) Configure a multi-node (two nodes) setup with SR-IOV enabled on both computes 2) Boot a guest with SR-IOV device 3) In a loop, do: a) Cold migrate the guest b) Revert resize the guest Expected result =============== It should be valid to successfully revert migrate a guest over and over (multiple times in a row). Actual result ============= What happened instead of the expected result? How did the issue look like? Environment =========== commit 633c817de5a67e798d8610d0df1135e5a568fd8a Author: Matt Riedemann <mrie...@us.ibm.com> Date: Sat Nov 12 11:59:13 2016 -0500 api-ref: fix server_id in metadata docs The api-ref was saying that the server_id was in the body of the server metadata requests but it's actually in the path for all of the requests. Change-Id: Icdecd980767f89ee5fcc5bdd4802b2c263268a26 Closes-Bug: #1641331 Logs & Configs ============== In the output below, all the logs with '***' are code I've added in order to trouble shoot the issue. [1] On source node (last successful revert migrate): 2016-11-22 17:35:11.875 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_available_resource 2016-11-22 17:35:11.936 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_usage_from_instance 2016-11-22 17:35:11.971 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_usage_from_migrations 2016-11-22 17:35:11.972 ERROR nova.pci.manager [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** clean_usage 2016-11-22 17:35:11.973 ERROR nova.pci.stats [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** Adding device 0000:85:11.7 2016-11-22 17:35:11.973 ERROR nova.pci.stats [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** Pool is now: [vendor:8086, product:10ed, numa:1, physnet:default, dev_type:type-VF]: ['0000:85:10.1', '0000:85:10.3', '0000:85:10.5', '0000:85:11.1', '0000:85:11.3', '0000:85:11.7'] 2016-11-22 17:35:33.817 ERROR nova.compute.manager [req-fba89cf1-21bf-482f-9b1d-3d634f5b3d73 admin admin] *** finish_revert_resize 2016-11-22 17:35:35.903 INFO nova.compute.manager [req-7d27d388-0613-451b-a9da-34dde9d5d0f2 None None] [instance: e99dc7ad-3c1c-4cd0-96d1-00e6cbcb8725] VM Resumed (Lifecycle Event) 2016-11-22 17:35:35.999 INFO nova.compute.manager [req-7d27d388-0613-451b-a9da-34dde9d5d0f2 None None] [instance: e99dc7ad-3c1c-4cd0-96d1-00e6cbcb8725] VM Started (Lifecycle Event) [2] On destination node (next migration): 2016-11-22 17:41:31.739 ERROR nova.compute.manager [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** finish_resize - uuid: 19d615e1-04d3-42ea-b30f-f6cfc25f0397 2016-11-22 17:41:31.740 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** _update_port_binding_for_instance 2016-11-22 17:41:31.793 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** port: {u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'updated_at': u'2016-11-22T22:41:31Z', u'device_owner': u'compute:None', u'revision_number': 5011, u'port_security_enabled': True, u'binding:profile': {u'pci_slot': u'0000:85:11.7', u'physical_network': u'default', u'pci_vendor_info': u'8086:10ed'}, u'fixed_ips': [{u'subnet_id': u'd9c807e8-f5d3-4eb8-9d1e-dee0edc922b7', u'ip_address': u'10.0.0.6'}, {u'subnet_id': u'd1a618f1-217b-4a53-9bd4-c89cb8e55316', u'ip_address': u'2001:db8:8000:0:f816:3eff:fee7:b3d'}], u'id': u'd55c021a-d45c-4cfa-9607-11cf572b687e', u'security_groups': [u'fa65d63f-7466-47c0-aa56-bd827077fed9'], u'binding:vif_details': {u'port_filter': False, u'vlan': u'468'}, u'binding:vif_type': u'hw_veb', u'mac_address': u'fa:16:3e:e7:0b:3d', u'project_id': u'caa3cffc0402480ba611bd4ccb87492a', u'status': u'DOWN', u'binding:host_id': u'IronPass-3', u' description': u'', u'device_id': u'19d615e1-04d3-42ea-b30f-f6cfc25f0397', u'name': u'sriov_port_4', u'admin_state_up': True, u'network_id': u'3a4e3f83-7b84-4af8-be90-288f944f8fa7', u'tenant_id': u'caa3cffc0402480ba611bd4ccb87492a', u'created_at': u'2016-11-21T17:10:33Z', u'binding:vnic_type': u'direct'} 2016-11-22 17:41:31.793 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** _get_pci_mapping_for_migration 2016-11-22 17:41:31.794 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context: MigrationContext(created_at=<?>,deleted=<?>,deleted_at=<?>,instance_uuid=19d615e1-04d3-42ea-b30f-f6cfc25f0397,migration_id=2127,new_numa_topology=None,new_pci_devices=PciDeviceList,new_pci_requests=InstancePCIRequests,old_numa_topology=None,old_pci_devices=PciDeviceList,old_pci_requests=InstancePCIRequests,updated_at=<?>) 2016-11-22 17:41:31.794 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context.old_pci_devices.objects: [] 2016-11-22 17:41:31.795 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context.new_pci_devices.objects: [PciDevice(address='0000:05:11.5',compute_node_id=1,created_at=2016-11-14T15:44:04Z,deleted=False,deleted_at=None,dev_id='pci_0000_05_11_5',dev_type='type-VF',extra_info={},id=7,instance_uuid='19d615e1-04d3-42ea-b30f-f6cfc25f0397',label='label_8086_10ed',numa_node=0,parent_addr='0000:05:00.1',product_id='10ed',request_id='06dd5913-5960-4fc9-ac1c-4f0bdfe344a7',status='claimed',updated_at=2016-11-22T22:40:17Z,vendor_id='8086')] 2016-11-22 17:41:31.795 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** pci_mapping: {} 2016-11-22 17:41:31.795 ERROR nova.compute.manager [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] Setting instance vm_state to ERROR 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] Traceback (most recent call last): 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] File "/opt/stack/nova/nova/compute/manager.py", line 4003, in finish_resize 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] disk_info, image_meta) 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] File "/opt/stack/nova/nova/compute/manager.py", line 3935, in _finish_resize 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] migration_p) 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] File "/opt/stack/nova/nova/network/neutronv2/api.py", line 2033, in migrate_instance_finish 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] migration=migration) 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] File "/opt/stack/nova/nova/network/neutronv2/api.py", line 2424, in _update_port_binding_for_instance 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] pci_slot) 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] PortUpdateFailed: Port update failed for port d55c021a-d45c-4cfa-9607-11cf572b687e: Unable to correlate PCI slot 0000:85:11.7 2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1644042/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp