Public bug reported: Description =========== We have an OpenStack with a RabbitMQ cluster of 3 nodes, and with dozens of nova-compute nodes. When we shut down 1 out of 3 RabbitMQ nodes, Nagios alerted nova-compute.service down for 2 nova-compute nodes.
Upon checking, we found that nova-compute.service is running. nova-compute.service - OpenStack Compute Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled) Active: active (running) since Fri 2024-02-16 00:42:47 UTC; 4 days ago Main PID: 10130 (nova-compute) Tasks: 32 (limit: 463517) Memory: 248.2M CPU: 55min 5.217s CGroup: /system.slice/nova-compute.service ├─10130 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log ├─11527 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpc0sosqey/privsep.sock └─11702 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tmp/tmp2ik7rchu/privsep.sock Feb 16 00:42:53 node002 sudo[11540]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64060) Feb 16 00:42:54 node002 sudo[11540]: pam_unix(sudo:session): session closed for user root Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last): Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers Feb 20 04:55:31 node002 nova-compute[10130]: timer() Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__ Feb 20 04:55:31 node002 nova-compute[10130]: cb(*args, **kw) Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire Feb 20 04:55:31 node002 nova-compute[10130]: waiter.switch() Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread I guess it's possible that when shutting down a RabbitMQ node, nova-compute is experiencing contention or state inconsistencies in processing connection recovery restarting nova-compute.service can resolve the problem. Logs & Configs ============== The nova-compute.log: 2024-02-20 04:55:28.675 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer 2024-02-20 04:55:29.677 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED 2024-02-20 04:55:30.682 10130 INFO oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] Reconnected to AMQP server on 10.10.10.52:5672 via [amqp] client with port 35346. 2024-02-20 04:55:31.361 10130 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer 然后systemctl status nova-compute Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last): Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers Feb 20 04:55:31 node002 nova-compute[10130]: timer() Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__ Feb 20 04:55:31 node002 nova-compute[10130]: cb(*args, **kw) Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire Feb 20 04:55:31 node002 nova-compute[10130]: waiter.switch() Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread Jammy + nova-compute(3:25.2.0-0ubuntu1) + rabbitmq-server (3.9) nova.conf: [oslo_messaging_rabbit] [oslo_messaging_notifications] driver = messagingv2 transport_url = ********* [notifications] notification_format = unversioned ** Affects: nova Importance: Undecided Status: New ** Tags: sts -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2054502 Title: shutdowning rabbitmq causes nova-compute.service down Status in OpenStack Compute (nova): New Bug description: Description =========== We have an OpenStack with a RabbitMQ cluster of 3 nodes, and with dozens of nova-compute nodes. When we shut down 1 out of 3 RabbitMQ nodes, Nagios alerted nova-compute.service down for 2 nova-compute nodes. Upon checking, we found that nova-compute.service is running. nova-compute.service - OpenStack Compute Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled) Active: active (running) since Fri 2024-02-16 00:42:47 UTC; 4 days ago Main PID: 10130 (nova-compute) Tasks: 32 (limit: 463517) Memory: 248.2M CPU: 55min 5.217s CGroup: /system.slice/nova-compute.service ├─10130 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log ├─11527 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpc0sosqey/privsep.sock └─11702 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tmp/tmp2ik7rchu/privsep.sock Feb 16 00:42:53 node002 sudo[11540]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64060) Feb 16 00:42:54 node002 sudo[11540]: pam_unix(sudo:session): session closed for user root Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last): Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers Feb 20 04:55:31 node002 nova-compute[10130]: timer() Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__ Feb 20 04:55:31 node002 nova-compute[10130]: cb(*args, **kw) Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire Feb 20 04:55:31 node002 nova-compute[10130]: waiter.switch() Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread I guess it's possible that when shutting down a RabbitMQ node, nova-compute is experiencing contention or state inconsistencies in processing connection recovery restarting nova-compute.service can resolve the problem. Logs & Configs ============== The nova-compute.log: 2024-02-20 04:55:28.675 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer 2024-02-20 04:55:29.677 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED 2024-02-20 04:55:30.682 10130 INFO oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] Reconnected to AMQP server on 10.10.10.52:5672 via [amqp] client with port 35346. 2024-02-20 04:55:31.361 10130 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer 然后systemctl status nova-compute Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last): Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers Feb 20 04:55:31 node002 nova-compute[10130]: timer() Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__ Feb 20 04:55:31 node002 nova-compute[10130]: cb(*args, **kw) Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire Feb 20 04:55:31 node002 nova-compute[10130]: waiter.switch() Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread Jammy + nova-compute(3:25.2.0-0ubuntu1) + rabbitmq-server (3.9) nova.conf: [oslo_messaging_rabbit] [oslo_messaging_notifications] driver = messagingv2 transport_url = ********* [notifications] notification_format = unversioned To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2054502/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp