[Yahoo-eng-team] [Bug 2076228] [NEW] nova-scheduler fails to acquire lock on hosts on live migration

2024-08-07 Thread Radu Malica
Public bug reported:

Description
=

I am running OpenStack Antelope via charmed with juju deployment on Ceph
backed storage. Antelope was upgraded from Zed which was originally
deployed following the official OpenStack charmed guide upgrade.

All hosts are running the same hardware, they are Dell PowerEdge R610
with 24 cores and 48gb of RAM

Tried --live-migration with volume backed VMs, with image backed VMs
(and --block-migration). All hosts have /var/lib/nova/instances shared
via NFS for local storage.

VMs that should be live migrated do not have extra configuration
properties linking to AZs or similar. Plain VMs created from Horizon
dashboard.

Steps to reproduce
==

Upgrade from Zed to Antelope, try to live-migrate VMs


Logs & Configs
=

Environment uses libvirt/KVM with neutron-api and OVN SDN.

Nova version 27.1.0
ii  nova-api-os-compute  3:27.1.0-0ubuntu1.2~cloud0 
  all  OpenStack Compute - OpenStack Compute API frontend
ii  nova-common  3:27.1.0-0ubuntu1.2~cloud0 
  all  OpenStack Compute - common files
ii  nova-conductor   3:27.1.0-0ubuntu1.2~cloud0 
  all  OpenStack Compute - conductor service
ii  nova-scheduler   3:27.1.0-0ubuntu1.2~cloud0 
  all  OpenStack Compute - virtual machine scheduler
ii  nova-spiceproxy  3:27.1.0-0ubuntu1.2~cloud0 
  all  OpenStack Compute - spice html5 proxy
ii  python3-nova 3:27.1.0-0ubuntu1.2~cloud0 
  all  OpenStack Compute Python 3 libraries
ii  python3-novaclient   2:18.3.0-0ubuntu1~cloud0   
  all  client library for OpenStack Compute API - 3.x


Filters enabled: 
AvailabilityZoneFilter,ComputeFilter,ImagePropertiesFilter,DifferentHostFilter,SameHostFilter

Charm configs are default with no changes, live migration worked in Zed,
but now, after debugging the nova-cloud-controller:

FULL LOG:  https://pastebin.com/NvMazzkC

In short, nova-scheduler iterates through the hosts and immediately this
happens with each available host until the list is exhausted:

2024-08-07 10:15:36.663 1307737 DEBUG oslo_concurrency.lockutils [None
req-2aa2922e-66b3-4543-81d5-ce8d92fb0eeb
91e3c47f7f6a42f1946f9b96d6e07be7 8ce43a2a472e424e8419635cd279b222 - -
da112566f0a44d0c898dde46aee63dd7 da112566f0a44d0c898dde46aee63dd7] Lock
"('os-host-10.maas', 'os-host-10.maas')" "released" by
"nova.scheduler.host_manager.HostState.update.._locked_update"
:: held 0.003s inner /usr/lib/python3/dist-
packages/oslo_concurrency/lockutils.py:423

2024-08-07 10:15:36.663 1307737 DEBUG oslo_concurrency.lockutils [None 
req-2aa2922e-66b3-4543-81d5-ce8d92fb0eeb 91e3c47f7f6a42f1946f9b96d6e07be7 
8ce43a2a472e424e8419635cd279b222 - - 
da112566f0a44d0c898dde46aee63dd7 da112566f0a44d0c898dde46aee63dd7] Acquiring 
lock "('os-host-11.maas', 'os-host-11.maas')" by 
"nova.scheduler.host_manager.HostState.update.._locked_update" inner 
/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py:404

2024-08-07 10:15:36.663 1307737 DEBUG oslo_concurrency.lockutils [None
req-2aa2922e-66b3-4543-81d5-ce8d92fb0eeb
91e3c47f7f6a42f1946f9b96d6e07be7 8ce43a2a472e424e8419635cd279b222 - -
da112566f0a44d0c898dde46aee63dd7 da112566f0a44d0c898dde46aee63dd7] Lock
"('os-host-11.maas', 'os-host-11.maas')" acquired by
"nova.scheduler.host_manager.HostState.update.._locked_update"
:: waited 0.000s inner /usr/lib/python3/dist-
packages/oslo_concurrency/lockutils.py:409

2024-08-07 10:15:36.664 1307737 DEBUG nova.scheduler.host_manager [None
req-2aa2922e-66b3-4543-81d5-ce8d92fb0eeb
91e3c47f7f6a42f1946f9b96d6e07be7 8ce43a2a472e424e8419635cd279b222 - -
da112566f0a44d0c898dde46aee63dd7 da112566f0a44d0c898dde46aee63dd7]
Update host state from compute node ( all properties here pulled from
that compute node)

Update host state with aggregates:
[Aggregate(created_at=2023-11-01T17:48:42Z,deleted=False,deleted_at=None,hosts=['os-
host-4-shelf.maas','os-host-1.maas','os-host-2.maas','os-
host-9.maas','os-host-11.maas','os-host-10.maas','os-host-6.maas','os-
host-8.maas','os-host-7.maas','os-host-5.maas','os-
host-3.maas'],id=1,metadata={availability_zone='nova'},name='nova_az',updated_at=None,uuid=9e0b10a6-8030-4bbf-92a7-724d4cb3a0d0)]
_locked_update /usr/lib/python3/dist-
packages/nova/scheduler/host_manager.py:172

 Update host state with service dict: {'id': 52, 'uuid':
'c6778fc7-5575-4859-b6ad-cdca697cebac', 'host': 'os-host-11.maas',
'binary': 'nova-compute', 'topic': 'compute', 'report_count': 14216,
'disabled': False, 'disabled_reason': None, 'last_seen_up':
datetime.datetime(2024, 8, 7, 10, 15, 36, tzinfo=datetime.timezone.utc),
'forced_down': False, 'version': 66, 'created_at':
datetime.datetime(2024, 8, 5, 18, 44, 9, tzinfo=dateti

[Yahoo-eng-team] [Bug 2076437] [NEW] glance doesn't convert images to RAW if setting is enabled

2024-08-09 Thread Radu Malica
Public bug reported:

DESCRIPTION
===

Glance 26.0.0, Openstack Antelope deployed with Juju Charms.

Settings in glance-api.conf:

---

[image_import_opts]
image_import_plugins = ['image_conversion']

[image_conversion]
output_format = raw


--


I am syncing images using simplestreams-sync latest/edge. SS pulls IMG 
filenames for Ubuntu 24.04 amd 64 resulting in download of the respective image 
which is QCOW2 format.

If SS is configured to enable-image-conversion (Glance has to have that
option enabled which it has) then SS assumes that downloaded image is
RAW and it pushes that to Glance:


glance create_kwargs {'name': 
'auto-sync/ubuntu-noble-24.04-amd64-server-20240809-disk1.img', 
'container_format': 'bare', 'visibility': 'public', 'disk_format': 'raw'}


Glance should check with its inspectors in the image_conversion.py if
the uploaded image is indeed RAW or not, it seems the check fails:

---
Source is already in target format, not doing conversion for 
6a79cbd5-cd19-4aea-a155-f982ed75ac62 _execute 
/usr/lib/python3/dist-packages/glance/async_/flows/plugins/image_conversion.py:181
--

What happens in that file, the check is running "qemu-img info -f
source_format --format=json /uploadedfilebyss  which in our case would
translale to "qemu-img info -f raw "

That results always in RAW because the source_format provided by image
create command from SS is RAW hence the message "Source is already in
target format".

Then the image shows in Images as RAW but it has 568M, if I download it
from glance it is actually QCOW2 and not RAW.

example below:

ubuntu-24.04-server-cloudimg-amd64.img => downloaded from official
cloud-images

qemu-img info --output=json ubuntu-24.04-server-cloudimg-amd64.img

{
"virtual-size": 3758096384,
"filename": "ubuntu-24.04-server-cloudimg-amd64.img",
"cluster-size": 65536,
"format": "qcow2",
"actual-size": 585830400,
"format-specific": {
"type": "qcow2",
"data": {
"compat": "1.1",
"compression-type": "zlib",
"lazy-refcounts": false,
"refcount-bits": 16,
"corrupt": false,
"extended-l2": false
}
},
"dirty-flag": false
}


running the same command as the image_conversion plugin is doing:

qemu-img info -f raw --output=json ubuntu-24.04-server-cloudimg-
amd64.img

{
"virtual-size": 585826304,
"filename": "ubuntu-24.04-server-cloudimg-amd64.img",
"format": "raw",
"actual-size": 585830400,
"dirty-flag": false
}


I manually removed the -f source_format from image_conversion.py so the
command actually returns the correct format and i get this error :

"Failed to execute task a805c37c-ea60-4b1b-9a75-9dee11e4e5cc: Image
metadata disagrees about format: RuntimeError: Image metadata disagrees
about format"

I reverted my change and disabled image_conversion in simplestreams, and
now SS uploads the image to Glance as QCOW2 as it normally is.

But now, no image conversion happens at Glance level even though is
configured, and the system actually skips the step altogether,
proceeding to upload the image to RBD as normal.


cut here=


2024-08-09 15:09:41.472 543146 DEBUG glance.location [None 
req-443b8021-070f-462c-acd3-1831cc85f080 e8ed0a8415ef498ebcece070bef55de2 
021dc8ae82ae4a5f9242eeec1ac86005 - - 021a9c699be4425c889335019f91910e 
021a9c699be4425c889335019f91910e] Enabling in-flight format inspection for 
qcow2 set_data /usr/lib/python3/dist-packages/glance/location.py:581
2024-08-09 15:09:41.473 543146 DEBUG glance_store.multi_backend [None 
req-443b8021-070f-462c-acd3-1831cc85f080 e8ed0a8415ef498ebcece070bef55de2 
021dc8ae82ae4a5f9242eeec1ac86005 - - 021a9c699be4425c889335019f91910e 
021a9c699be4425c889335019f91910e] Attempting to import store rbd 
_load_multi_store 
/usr/lib/python3/dist-packages/glance_store/multi_backend.py:170
2024-08-09 15:09:41.495 543146 DEBUG glance_store.capabilities [None 
req-443b8021-070f-462c-acd3-1831cc85f080 e8ed0a8415ef498ebcece070bef55de2 
021dc8ae82ae4a5f9242eeec1ac86005 - - 021a9c699be4425c889335019f91910e 
021a9c699be4425c889335019f91910e] Store glance_store._drivers.rbd.Store doesn't 
support updating dynamic storage capabilities. Please overwrite 
'update_capabilities' method of the store to implement updating logics if 
needed. update_capabilities 
/usr/lib/python3/dist-packages/glance_store/capabilities.py:91
2024-08-09 15:09:41.496 543146 DEBUG glance_store.driver [None 
req-443b8021-070f-462c-acd3-1831cc85f080 e8ed0a8415ef498ebcece070bef55de2 
021dc8ae82ae4a5f9242eeec1ac86005 - - 021a9c699be4425c889335019f91910e 
021a9c699be4425c889335019f91910e] Late loading location class 
glance_store._drivers.rbd.StoreLocation get_store_location_class 
/usr/lib/python3/dist-packages/glance_store/driver.py:116
2024-08-09 15:09:41.496 543146 DEBUG glance_store.location [None 
req-443b80