** Also affects: devstack Importance: Undecided Status: New ** Description changed:
Description =========== - The initial request to destroy the instance - (45adbb55-491d-418b-ba68-7db43d1c235b / instance-0000000d) that had been - live migrated to the host comes into n-cpu here: - - https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova- - live-migration/1cb6fb3/compute1/logs/screen-n-cpu.txt - - 9916 Oct 01 11:10:33.162056 ubuntu-focal-iweb-mtl01-0026751352 nova- - compute[67743]: DEBUG oslo_concurrency.lockutils [None - req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest- - LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] - Lock "45adbb55-491d-418b-ba68-7db43d1c235b" acquired by - "nova.compute.manager.ComputeManager.terminate_instance.<locals>.do_terminate_instance" - :: waited 0.000s {{(pid=67743) inner /usr/local/lib/python3.8/dist- - packages/oslo_concurrency/lockutils.py:355}} - - In libvirtd we see the thread attempt and fail to terminate the process - that appears busy: - - https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova- - live-migration/1cb6fb3/compute1/logs/libvirt/libvirt/libvirtd_log.txt - - 77557 2021-10-01 11:10:33.173+0000: 57258: debug : virThreadJobSet:93 : Thread 57258 (virNetServerHandleJob) is now running job remoteDispatchDomainDestroy - 77558 2021-10-01 11:10:33.173+0000: 57258: debug : qemuProcessKill:7197 : vm=0x7f93400f98c0 name=instance-0000000d pid=73535 flags=0x1 - 77559 2021-10-01 11:10:33.173+0000: 57258: debug : virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0 - [..] - 77673 2021-10-01 11:10:43.180+0000: 57258: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 73535, sending SIGKILL - [..] - 77674 2021-10-01 11:11:13.202+0000: 57258: error : virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with SIGKILL: Device or resource busy - 77675 2021-10-01 11:11:13.202+0000: 57258: debug : qemuDomainObjBeginJobInternal:9416 : Starting job: job=destroy agentJob=none asyncJob=none (vm=0x7f93400f98c0 name=instance-0000000d, current job=none agentJob= none async=none) - 77676 2021-10-01 11:11:13.202+0000: 57258: debug : qemuDomainObjBeginJobInternal:9470 : Started job: destroy (async=none vm=0x7f93400f98c0 name=instance-0000000d) - 77677 2021-10-01 11:11:13.203+0000: 57258: debug : qemuProcessStop:7279 : Shutting down vm=0x7f93400f98c0 name=instance-0000000d id=14 pid=73535, reason=destroyed, asyncJob=none, flags=0x0 - 77678 2021-10-01 11:11:13.203+0000: 57258: debug : qemuDomainLogAppendMessage:10691 : Append log message (vm='instance-0000000d' message='2021-10-01 11:11:13.203+0000: shutting down, reason=destroyed - 77679 ) stdioLogD=1 - 77680 2021-10-01 11:11:13.204+0000: 57258: info : qemuMonitorClose:916 : QEMU_MONITOR_CLOSE: mon=0x7f93400ecce0 refs=4 - 77681 2021-10-01 11:11:13.205+0000: 57258: debug : qemuProcessKill:7197 : vm=0x7f93400f98c0 name=instance-0000000d pid=73535 flags=0x5 - 77682 2021-10-01 11:11:13.205+0000: 57258: debug : virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0 - 77683 2021-10-01 11:11:23.213+0000: 57258: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 73535, sending SIGKILL - [..] - 77684 2021-10-01 11:11:53.237+0000: 57258: error : virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with SIGKILL: Device or resource busy - - Towards the end we get an idea why this could be happening with the - following virNetDevTapDelete failure: - - 77689 2021-10-01 11:11:53.241+0000: 57258: error : virNetDevTapDelete:339 : Unable to associate TAP device: Device or resource busy - 77690 2021-10-01 11:11:53.241+0000: 57258: debug : virSystemdTerminateMachine:488 : Attempting to terminate machine via systemd - - libvirtd then attempts to kill the domain via systemd and this - eventually completes but in the meantime the attached volume has - disappeared, this might be due to Tempest cleaning the volume up in the - background: - - 77955 2021-10-01 11:11:54.266+0000: 57258: debug : virCgroupV1Remove:699 : Done removing cgroup /machine.slice/machine-qemu\x2d14\x2dinstance\x2d0000000d.scope - 77956 2021-10-01 11:11:54.266+0000: 57258: warning : qemuProcessStop:7488 : Failed to remove cgroup for instance-0000000d - 77957 2021-10-01 11:12:30.391+0000: 57258: error : virProcessRunInFork:1159 : internal error: child reported (status=125): unable to open /dev/sdb: No such device or address - 77958 2021-10-01 11:12:30.391+0000: 57258: warning : qemuBlockRemoveImageMetadata:2629 : Unable to remove disk metadata on vm instance-0000000d from /dev/sdb (disk target vda) - 77959 2021-10-01 11:12:30.392+0000: 57258: debug : virObjectEventNew:631 : obj=0x7f934c087670 - 77960 2021-10-01 11:12:30.392+0000: 57258: debug : qemuDomainObjEndJob:9746 : Stopping job: destroy (async=none vm=0x7f93400f98c0 name=instance-0000000d) - [.] - 77962 2021-10-01 11:12:30.393+0000: 57258: debug : virThreadJobClear:118 : Thread 57258 (virNetServerHandleJob) finished job remoteDispatchDomainDestroy with ret=0 - - Back in n-cpu we see the vif unplug correctly: - - https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova- - live-migration/1cb6fb3/compute1/logs/screen-n-cpu.txt - - 9990 Oct 01 11:12:30.475937 ubuntu-focal-iweb-mtl01-0026751352 nova- - compute[67743]: INFO os_vif [None - req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest- - LiveMigrationTest-22574426 tempest-LiveMigrationTest-225744 - 26-project] Successfully unplugged vif - VIFOpenVSwitch(active=True,address=fa:16:3e:68:e0:63,bridge_name='br- - int',has_traffic_filtering=True,id=ea034ac1-f87f-4319-8c17-a9748af7990d,network=Network(cff81725- - 6326-4be1-97f5-116b0b60d276),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapea034ac1-f8') - - - We then encounter issues disconnecting the attached volume from the instance: - - 9993 Oct 01 11:12:30.479947 ubuntu-focal-iweb-mtl01-0026751352 nova-compute[67743]: DEBUG oslo_concurrency.lockutils [None req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] Lock "connect_volume" acquired by "os_brick.initiator.connectors.iscsi.ISCSIConnector.disconnect_volume" :: waited 0.000s {{(pid=67743) inner /usr/local/lib/python3.8/dist- packages/oslo_concurrency/lockutils.py:355}} - [..] - 10122 Oct 01 11:13:30.641380 ubuntu-focal-iweb-mtl01-0026751352 nova-compute[67743]: WARNING nova.virt.libvirt.driver [None req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] [instance: 45adbb55-491d-418b-ba68-7db43d1c235b] Ignoring Volume Error on vol ba95db3d-c2ff-40c6-ba7c-563c3070c85d during delete Unexpected error while running command. - - Finally there's another ~60 second gap until the request finally - completes: - - 10135 Oct 01 11:14:33.313929 ubuntu-focal-iweb-mtl01-0026751352 nova- - compute[67743]: INFO nova.compute.manager [None - req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest- - LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] - [instance: 45adbb55-491d-418b-ba68- 7db43d1c235b] Took 240.14 seconds to - destroy the instance on the hypervisor. - + See c#2 for the actual issue here. Steps to reproduce ================== LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk or any live migration test with trunk ports fails during cleanup. Expected result =============== Both the test and cleanup pass without impacting libvirtd. Actual result ============= The test passes, cleanup locks up the single thread handling the libvirtd event loop in 6.0.0. Environment =========== 1. Exact version of OpenStack you are running. See the following - list for all releases: http://docs.openstack.org/releases/ + list for all releases: http://docs.openstack.org/releases/ - stable/xena and master + stable/xena and master 2. Which hypervisor did you use? - (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...) - What's the version of that? + (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...) + What's the version of that? - libvirt (6.0.0) and QEMU + libvirt (6.0.0) and QEMU 2. Which storage type did you use? - (For example: Ceph, LVM, GPFS, ...) - What's the version of that? + (For example: Ceph, LVM, GPFS, ...) + What's the version of that? - N/A + N/A 3. Which networking type did you use? - (For example: nova-network, Neutron with OpenVSwitch, ...) + (For example: nova-network, Neutron with OpenVSwitch, ...) - Trunk ports. + Trunk ports. Logs & Configs ============== Initially discovered and discussed as part of https://bugs.launchpad.net/nova/+bug/1912310 where the locking up within libvirtd causes other tests to then fail. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1945983 Title: Duplicate iSCSI initiators causing live migration failures Status in devstack: New Status in OpenStack Compute (nova): Incomplete Bug description: Description =========== See c#2 for the actual issue here. Steps to reproduce ================== LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk or any live migration test with trunk ports fails during cleanup. Expected result =============== Both the test and cleanup pass without impacting libvirtd. Actual result ============= The test passes, cleanup locks up the single thread handling the libvirtd event loop in 6.0.0. Environment =========== 1. Exact version of OpenStack you are running. See the following list for all releases: http://docs.openstack.org/releases/ stable/xena and master 2. Which hypervisor did you use? (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...) What's the version of that? libvirt (6.0.0) and QEMU 2. Which storage type did you use? (For example: Ceph, LVM, GPFS, ...) What's the version of that? N/A 3. Which networking type did you use? (For example: nova-network, Neutron with OpenVSwitch, ...) Trunk ports. Logs & Configs ============== Initially discovered and discussed as part of https://bugs.launchpad.net/nova/+bug/1912310 where the locking up within libvirtd causes other tests to then fail. To manage notifications about this bug go to: https://bugs.launchpad.net/devstack/+bug/1945983/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp