Public bug reported: Description =========== Openstack conductor threads hang while building the instance. The instance was stuck in the building state while having the placement allocation in the DB, but the Conductor failed to put an entry in the nova_cell1 instances table. Once the issue happens, we must delete and recreate the stack. Error in Conductor
Traceback (most recent call last): File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers timer() File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__ cb(*args, **kw) File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire waiter.switch() greenlet.error: cannot switch to a different thread Steps to reproduce ================== Shut down rabbitmq for 5 minutes to emulate a failover scenario requiring the conductor threads to connect back after failure. Once the rabbitmq returns online, wait a minute before spinning the stack on multiple Hypervisor. Expected result =============== All VMs are up and running, and all the volumes are attached to the VM. Actual result ============= What happened instead of the expected result? Randomly, on different computers, the VM gets stuck in the build scheduled state with compute logs showing the below errors. Compute Logs 2025-05-22 17:14:55.703 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. 2025-05-22 17:15:55.710 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. 2025-05-22 17:16:56.608 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. 2025-05-22 17:17:58.701 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. 2025-05-22 17:19:00.692 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. 2025-05-22 17:20:01.661 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. During the failure, the Conductor got the below greenlet thread error, and it failed to put an entry in the Nova cell database. 2025-05-22 18:41:53.101 7 INFO oslo.messaging._drivers.impl_rabbit [-] [4924e790-3518-4fae-8856-1c4336d4ee72] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37784. 2025-05-22 18:41:53.613 12 INFO oslo.messaging._drivers.impl_rabbit [-] [19734c27-f068-4514-8ddb-78e0dbaeb0db] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37792. 2025-05-22 18:41:53.889 12 INFO oslo.messaging._drivers.impl_rabbit [-] [2bae4372-d3a9-4acd-b575-00a27c8ca11a] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37798. 2025-05-22 18:41:54.525 8 INFO oslo.messaging._drivers.impl_rabbit [-] [7073f50d-61f7-4d51-ace6-5d5f1f0d0ab7] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37800. Traceback (most recent call last): File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers timer() File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__ cb(*args, **kw) File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire waiter.switch() greenlet.error: cannot switch to a different thread Environment =========== 1. Exact version of OpenStack you are running. See the following Openstack caracal. If this is from a distro please provide $nova@nova-conductor-7c8949bfd-5pmfr:/$ nova-conductor --version 29.2.1 nova@nova-conductor-7c8949bfd-5pmfr:/$ 2. Which hypervisor did you use? Libvirt + KVM What's the version of that? Libvirt : 8.0.0 Kernel : 5.15.0-136-generic 2. Which storage type did you use? What's the version of that? Ceph Squid. 3. Which networking type did you use? Neutron with Openvswitch in Dpdk mode along with SRIOV agent What we have found so far: OpenStack uses Eventlet for greenlet threads. However, Eventlet is not maintained(it Supports only Python 2.x). So, in the recent version of OpenStack, we are using Mokey patching to call the Eventlet functions written in Python 2.x within python3.x. This scenario is leading to many threading issues. Some of the fixes(mentioned below) in OpenStack to prevent this from happening are already in Antelope, but we still hit this issue. https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17 https://opendev.org/openstack/oslo.log/commit/de615d9370681a2834cebe88acfa81b919da340c https://review.opendev.org/c/openstack/oslo.log/+/914190 However, a significant effort is being made to move OpenStack out of Eventlet dependency and into asyncio. This effort will take at least four or more releases, so a fix for this has not yet been released. Dev can follow defect links to track the changes. https://github.com/eventlet/eventlet/issues/432 https://github.com/eventlet/eventlet/issues/662 We also note that this defect, https://github.com/eventlet/eventlet/issues/662, explicitly states that Eventlet 0.29.0 didn't have this issue. We still need to verify this statement. We need to roll back to this version of Eventlet in our next release and make sure everything works fine. We also made changes to the heartbeat_in_pthread setting in the Nova API in 2.12; we also need to evaluate whether we require that setting anymore. But we can't downgrade the eventlet due to dependencies. And it is not even safe to do. As a data point, we also had the same issue with Antelope. A workaround for this issue is simply to retry the stack, but since it is creating many issues in our fail-over readiness testing, we would like community help to fix it. ** Affects: nova Importance: Undecided Status: New ** Tags: conductor -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2111617 Title: Nova conductor failed to put entry in db during the build Status in OpenStack Compute (nova): New Bug description: Description =========== Openstack conductor threads hang while building the instance. The instance was stuck in the building state while having the placement allocation in the DB, but the Conductor failed to put an entry in the nova_cell1 instances table. Once the issue happens, we must delete and recreate the stack. Error in Conductor Traceback (most recent call last): File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers timer() File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__ cb(*args, **kw) File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire waiter.switch() greenlet.error: cannot switch to a different thread Steps to reproduce ================== Shut down rabbitmq for 5 minutes to emulate a failover scenario requiring the conductor threads to connect back after failure. Once the rabbitmq returns online, wait a minute before spinning the stack on multiple Hypervisor. Expected result =============== All VMs are up and running, and all the volumes are attached to the VM. Actual result ============= What happened instead of the expected result? Randomly, on different computers, the VM gets stuck in the build scheduled state with compute logs showing the below errors. Compute Logs 2025-05-22 17:14:55.703 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. 2025-05-22 17:15:55.710 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. 2025-05-22 17:16:56.608 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. 2025-05-22 17:17:58.701 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. 2025-05-22 17:19:00.692 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. 2025-05-22 17:20:01.661 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database. During the failure, the Conductor got the below greenlet thread error, and it failed to put an entry in the Nova cell database. 2025-05-22 18:41:53.101 7 INFO oslo.messaging._drivers.impl_rabbit [-] [4924e790-3518-4fae-8856-1c4336d4ee72] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37784. 2025-05-22 18:41:53.613 12 INFO oslo.messaging._drivers.impl_rabbit [-] [19734c27-f068-4514-8ddb-78e0dbaeb0db] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37792. 2025-05-22 18:41:53.889 12 INFO oslo.messaging._drivers.impl_rabbit [-] [2bae4372-d3a9-4acd-b575-00a27c8ca11a] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37798. 2025-05-22 18:41:54.525 8 INFO oslo.messaging._drivers.impl_rabbit [-] [7073f50d-61f7-4d51-ace6-5d5f1f0d0ab7] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37800. Traceback (most recent call last): File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers timer() File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__ cb(*args, **kw) File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire waiter.switch() greenlet.error: cannot switch to a different thread Environment =========== 1. Exact version of OpenStack you are running. See the following Openstack caracal. If this is from a distro please provide $nova@nova-conductor-7c8949bfd-5pmfr:/$ nova-conductor --version 29.2.1 nova@nova-conductor-7c8949bfd-5pmfr:/$ 2. Which hypervisor did you use? Libvirt + KVM What's the version of that? Libvirt : 8.0.0 Kernel : 5.15.0-136-generic 2. Which storage type did you use? What's the version of that? Ceph Squid. 3. Which networking type did you use? Neutron with Openvswitch in Dpdk mode along with SRIOV agent What we have found so far: OpenStack uses Eventlet for greenlet threads. However, Eventlet is not maintained(it Supports only Python 2.x). So, in the recent version of OpenStack, we are using Mokey patching to call the Eventlet functions written in Python 2.x within python3.x. This scenario is leading to many threading issues. Some of the fixes(mentioned below) in OpenStack to prevent this from happening are already in Antelope, but we still hit this issue. https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17 https://opendev.org/openstack/oslo.log/commit/de615d9370681a2834cebe88acfa81b919da340c https://review.opendev.org/c/openstack/oslo.log/+/914190 However, a significant effort is being made to move OpenStack out of Eventlet dependency and into asyncio. This effort will take at least four or more releases, so a fix for this has not yet been released. Dev can follow defect links to track the changes. https://github.com/eventlet/eventlet/issues/432 https://github.com/eventlet/eventlet/issues/662 We also note that this defect, https://github.com/eventlet/eventlet/issues/662, explicitly states that Eventlet 0.29.0 didn't have this issue. We still need to verify this statement. We need to roll back to this version of Eventlet in our next release and make sure everything works fine. We also made changes to the heartbeat_in_pthread setting in the Nova API in 2.12; we also need to evaluate whether we require that setting anymore. But we can't downgrade the eventlet due to dependencies. And it is not even safe to do. As a data point, we also had the same issue with Antelope. A workaround for this issue is simply to retry the stack, but since it is creating many issues in our fail-over readiness testing, we would like community help to fix it. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2111617/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp