[Yahoo-eng-team] [Bug 2076228] [NEW] nova-scheduler fails to acquire lock on hosts on live migration
Public bug reported: Description = I am running OpenStack Antelope via charmed with juju deployment on Ceph backed storage. Antelope was upgraded from Zed which was originally deployed following the official OpenStack charmed guide upgrade. All hosts are running the same hardware, they are Dell PowerEdge R610 with 24 cores and 48gb of RAM Tried --live-migration with volume backed VMs, with image backed VMs (and --block-migration). All hosts have /var/lib/nova/instances shared via NFS for local storage. VMs that should be live migrated do not have extra configuration properties linking to AZs or similar. Plain VMs created from Horizon dashboard. Steps to reproduce == Upgrade from Zed to Antelope, try to live-migrate VMs Logs & Configs = Environment uses libvirt/KVM with neutron-api and OVN SDN. Nova version 27.1.0 ii nova-api-os-compute 3:27.1.0-0ubuntu1.2~cloud0 all OpenStack Compute - OpenStack Compute API frontend ii nova-common 3:27.1.0-0ubuntu1.2~cloud0 all OpenStack Compute - common files ii nova-conductor 3:27.1.0-0ubuntu1.2~cloud0 all OpenStack Compute - conductor service ii nova-scheduler 3:27.1.0-0ubuntu1.2~cloud0 all OpenStack Compute - virtual machine scheduler ii nova-spiceproxy 3:27.1.0-0ubuntu1.2~cloud0 all OpenStack Compute - spice html5 proxy ii python3-nova 3:27.1.0-0ubuntu1.2~cloud0 all OpenStack Compute Python 3 libraries ii python3-novaclient 2:18.3.0-0ubuntu1~cloud0 all client library for OpenStack Compute API - 3.x Filters enabled: AvailabilityZoneFilter,ComputeFilter,ImagePropertiesFilter,DifferentHostFilter,SameHostFilter Charm configs are default with no changes, live migration worked in Zed, but now, after debugging the nova-cloud-controller: FULL LOG: https://pastebin.com/NvMazzkC In short, nova-scheduler iterates through the hosts and immediately this happens with each available host until the list is exhausted: 2024-08-07 10:15:36.663 1307737 DEBUG oslo_concurrency.lockutils [None req-2aa2922e-66b3-4543-81d5-ce8d92fb0eeb 91e3c47f7f6a42f1946f9b96d6e07be7 8ce43a2a472e424e8419635cd279b222 - - da112566f0a44d0c898dde46aee63dd7 da112566f0a44d0c898dde46aee63dd7] Lock "('os-host-10.maas', 'os-host-10.maas')" "released" by "nova.scheduler.host_manager.HostState.update.._locked_update" :: held 0.003s inner /usr/lib/python3/dist- packages/oslo_concurrency/lockutils.py:423 2024-08-07 10:15:36.663 1307737 DEBUG oslo_concurrency.lockutils [None req-2aa2922e-66b3-4543-81d5-ce8d92fb0eeb 91e3c47f7f6a42f1946f9b96d6e07be7 8ce43a2a472e424e8419635cd279b222 - - da112566f0a44d0c898dde46aee63dd7 da112566f0a44d0c898dde46aee63dd7] Acquiring lock "('os-host-11.maas', 'os-host-11.maas')" by "nova.scheduler.host_manager.HostState.update.._locked_update" inner /usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py:404 2024-08-07 10:15:36.663 1307737 DEBUG oslo_concurrency.lockutils [None req-2aa2922e-66b3-4543-81d5-ce8d92fb0eeb 91e3c47f7f6a42f1946f9b96d6e07be7 8ce43a2a472e424e8419635cd279b222 - - da112566f0a44d0c898dde46aee63dd7 da112566f0a44d0c898dde46aee63dd7] Lock "('os-host-11.maas', 'os-host-11.maas')" acquired by "nova.scheduler.host_manager.HostState.update.._locked_update" :: waited 0.000s inner /usr/lib/python3/dist- packages/oslo_concurrency/lockutils.py:409 2024-08-07 10:15:36.664 1307737 DEBUG nova.scheduler.host_manager [None req-2aa2922e-66b3-4543-81d5-ce8d92fb0eeb 91e3c47f7f6a42f1946f9b96d6e07be7 8ce43a2a472e424e8419635cd279b222 - - da112566f0a44d0c898dde46aee63dd7 da112566f0a44d0c898dde46aee63dd7] Update host state from compute node ( all properties here pulled from that compute node) Update host state with aggregates: [Aggregate(created_at=2023-11-01T17:48:42Z,deleted=False,deleted_at=None,hosts=['os- host-4-shelf.maas','os-host-1.maas','os-host-2.maas','os- host-9.maas','os-host-11.maas','os-host-10.maas','os-host-6.maas','os- host-8.maas','os-host-7.maas','os-host-5.maas','os- host-3.maas'],id=1,metadata={availability_zone='nova'},name='nova_az',updated_at=None,uuid=9e0b10a6-8030-4bbf-92a7-724d4cb3a0d0)] _locked_update /usr/lib/python3/dist- packages/nova/scheduler/host_manager.py:172 Update host state with service dict: {'id': 52, 'uuid': 'c6778fc7-5575-4859-b6ad-cdca697cebac', 'host': 'os-host-11.maas', 'binary': 'nova-compute', 'topic': 'compute', 'report_count': 14216, 'disabled': False, 'disabled_reason': None, 'last_seen_up': datetime.datetime(2024, 8, 7, 10, 15, 36, tzinfo=datetime.timezone.utc), 'forced_down': False, 'version': 66, 'created_at': datetime.datetime(2024, 8, 5, 18, 44, 9, tzinfo=dateti
[Yahoo-eng-team] [Bug 2076437] [NEW] glance doesn't convert images to RAW if setting is enabled
Public bug reported: DESCRIPTION === Glance 26.0.0, Openstack Antelope deployed with Juju Charms. Settings in glance-api.conf: --- [image_import_opts] image_import_plugins = ['image_conversion'] [image_conversion] output_format = raw -- I am syncing images using simplestreams-sync latest/edge. SS pulls IMG filenames for Ubuntu 24.04 amd 64 resulting in download of the respective image which is QCOW2 format. If SS is configured to enable-image-conversion (Glance has to have that option enabled which it has) then SS assumes that downloaded image is RAW and it pushes that to Glance: glance create_kwargs {'name': 'auto-sync/ubuntu-noble-24.04-amd64-server-20240809-disk1.img', 'container_format': 'bare', 'visibility': 'public', 'disk_format': 'raw'} Glance should check with its inspectors in the image_conversion.py if the uploaded image is indeed RAW or not, it seems the check fails: --- Source is already in target format, not doing conversion for 6a79cbd5-cd19-4aea-a155-f982ed75ac62 _execute /usr/lib/python3/dist-packages/glance/async_/flows/plugins/image_conversion.py:181 -- What happens in that file, the check is running "qemu-img info -f source_format --format=json /uploadedfilebyss which in our case would translale to "qemu-img info -f raw " That results always in RAW because the source_format provided by image create command from SS is RAW hence the message "Source is already in target format". Then the image shows in Images as RAW but it has 568M, if I download it from glance it is actually QCOW2 and not RAW. example below: ubuntu-24.04-server-cloudimg-amd64.img => downloaded from official cloud-images qemu-img info --output=json ubuntu-24.04-server-cloudimg-amd64.img { "virtual-size": 3758096384, "filename": "ubuntu-24.04-server-cloudimg-amd64.img", "cluster-size": 65536, "format": "qcow2", "actual-size": 585830400, "format-specific": { "type": "qcow2", "data": { "compat": "1.1", "compression-type": "zlib", "lazy-refcounts": false, "refcount-bits": 16, "corrupt": false, "extended-l2": false } }, "dirty-flag": false } running the same command as the image_conversion plugin is doing: qemu-img info -f raw --output=json ubuntu-24.04-server-cloudimg- amd64.img { "virtual-size": 585826304, "filename": "ubuntu-24.04-server-cloudimg-amd64.img", "format": "raw", "actual-size": 585830400, "dirty-flag": false } I manually removed the -f source_format from image_conversion.py so the command actually returns the correct format and i get this error : "Failed to execute task a805c37c-ea60-4b1b-9a75-9dee11e4e5cc: Image metadata disagrees about format: RuntimeError: Image metadata disagrees about format" I reverted my change and disabled image_conversion in simplestreams, and now SS uploads the image to Glance as QCOW2 as it normally is. But now, no image conversion happens at Glance level even though is configured, and the system actually skips the step altogether, proceeding to upload the image to RBD as normal. cut here= 2024-08-09 15:09:41.472 543146 DEBUG glance.location [None req-443b8021-070f-462c-acd3-1831cc85f080 e8ed0a8415ef498ebcece070bef55de2 021dc8ae82ae4a5f9242eeec1ac86005 - - 021a9c699be4425c889335019f91910e 021a9c699be4425c889335019f91910e] Enabling in-flight format inspection for qcow2 set_data /usr/lib/python3/dist-packages/glance/location.py:581 2024-08-09 15:09:41.473 543146 DEBUG glance_store.multi_backend [None req-443b8021-070f-462c-acd3-1831cc85f080 e8ed0a8415ef498ebcece070bef55de2 021dc8ae82ae4a5f9242eeec1ac86005 - - 021a9c699be4425c889335019f91910e 021a9c699be4425c889335019f91910e] Attempting to import store rbd _load_multi_store /usr/lib/python3/dist-packages/glance_store/multi_backend.py:170 2024-08-09 15:09:41.495 543146 DEBUG glance_store.capabilities [None req-443b8021-070f-462c-acd3-1831cc85f080 e8ed0a8415ef498ebcece070bef55de2 021dc8ae82ae4a5f9242eeec1ac86005 - - 021a9c699be4425c889335019f91910e 021a9c699be4425c889335019f91910e] Store glance_store._drivers.rbd.Store doesn't support updating dynamic storage capabilities. Please overwrite 'update_capabilities' method of the store to implement updating logics if needed. update_capabilities /usr/lib/python3/dist-packages/glance_store/capabilities.py:91 2024-08-09 15:09:41.496 543146 DEBUG glance_store.driver [None req-443b8021-070f-462c-acd3-1831cc85f080 e8ed0a8415ef498ebcece070bef55de2 021dc8ae82ae4a5f9242eeec1ac86005 - - 021a9c699be4425c889335019f91910e 021a9c699be4425c889335019f91910e] Late loading location class glance_store._drivers.rbd.StoreLocation get_store_location_class /usr/lib/python3/dist-packages/glance_store/driver.py:116 2024-08-09 15:09:41.496 543146 DEBUG glance_store.location [None req-443b80