Public bug reported: Note: I'm creating this bug report both at nova and in neutron because I'm not sure where the bug is present.
Description =========== We run automated Python scripts on our monitoring system that create, test, and delete OpenStack resources. Each of our three Availability Zones (AZs) has two of these processes. These monitoring cycles repeat every 5 to 10 minutes. Approximately once every 1 to 2 days, one of these monitoring checks fails with the following error: https://paste.openstack.org/show/bt3BWnyrBbQspiFx8KPZ/ """ 2025-03-12 18:42:17,764 ERROR: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'neutronclient.common.exceptions.InternalServerError'> (HTTP 500) (Request-ID: req-3b755d12-27b1-4531-b948-b613520c05ae) """ Upon investigating this Request-ID in the nova-api logs, we found that nova reports an InternalServerError coming from neutron, along with a neutron Request-ID: https://paste.openstack.org/show/b6wGVVZNwaAuLmFgHvaq/ """ 2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi neutronclient.common.exceptions.InternalServerError: The server has either erred or is incapable of performing the requested operation.<br /><br /> """ Further investigation of that Request-ID in neutron logs revealed this specific error: https://paste.openstack.org/show/bJL9b5aY99XpZlLZdzgy/ """ 2025-03-12 18:42:17.754 24 ERROR oslo_middleware.catch_errors oslo_db.exception.RetryRequest: Network 6163d404-aa33-4d83-a8a4-ead7dc0d9e14 could not be found. """ Additionally, looking at the HTTP DELETE request, we confirmed that the network was indeed deleted just milliseconds (0.005s) before the nova error occurred: """ log_level:INFO, logger:openstack.neutron, programname:neutron-server, python_module:neutron.wsgi, timestamp:2025-03-12 18:42:17.749 payload: XX.XXX.XX.XXX,10.170.65.35 \"DELETE /v2.0/networks/6163d404-aa33-4d83-a8a4-ead7dc0d9e14 HTTP/1.0\" status: 204 len: 168 time: 0.6581874 """ This specific example was pretty close in timing — the network deletion happened just 0.005 seconds before the nova error. In other occurrences, the gap between the DELETE and the nova error was slightly larger (typically 0.5 to 1.2 seconds), but the same errors where triggered. What we suspect =============== What we suspect is happening is that at the creation part of a instance, nova requests all ports from neutron using the neutronclient. """ 2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/network/neutron.py", line 2664, in validate_networks 2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi ports = neutron.list_ports(**params)['ports'] """ We believe that either Neutron or neutronclient tries to fetch the network details for each port it has already gathered in a list. Since it has collected all the ports in the project, it then attempts to fetch all the information of each ports network in the project. The issue happens if another process (in our case, one of the other five Python scripts) deletes a network right after Neutron has gathered the ports but just before it tries to fetch the network details. When that happens, Neutron tries to look up the now-deleted network, triggering the error because the network was removed a split second earlier. Temporary workaround ==================== We'd figured out a workaround for this problem: /nova/nova/network/neutron.py:2715-2717 """ if quotas.get('port', -1) == -1: # Unlimited Port Quota return num_instances """ Setting the port quota of the project in question to unlimited (-1). Makes sure that validate_networks() just returns num_instances and skips the part where neutron.list_ports() is called (line 2721). This has for now fixed the problem for us. Steps to reproduce ================== So far, all occurrences have happened randomly during automated processes. The steps below are based on my understanding and speculation of what’s happening: 1. Create two private networks in the same project — Network A and Network B. 2. Ensure that both networks contain one or more ports. 3. Request to create an instance attached to Network A using openstacksdk. - During this process, nova-api calls neutron.list_ports(). - Neutron gathers all ports in the project. 4. Delete Network B — make sure this happens right AFTER Neutron has finished gathering all ports, but right BEFORE Neutron starts fetching network details for the ports it collected. - Neutron attempts to fetch the network information for all gathered ports, which includes Network B (now deleted). - Since Network B no longer exists, Neutron fails to find it, causing an error. X. As a result, the nova-api call and the openstacksdk operation fail. Expected result =============== The expected result is that either nova, neutron or neutronclient recognizes that the network is deleted. And either recollects all ports again or ignores the ports from the deleted network. So that nova can continue to run the instance creation. Actual result ============= Already described in description. Environment =========== OpenStack 2023.1 deployed w/ ansible-kolla Ubuntu 24.04 Python scripts: python3.8.10 w/ openstacksdk 4.0.1 Neutron networking ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2103501 Title: Race condition - Error creating instance if network is deleted by different process Status in neutron: New Bug description: Note: I'm creating this bug report both at nova and in neutron because I'm not sure where the bug is present. Description =========== We run automated Python scripts on our monitoring system that create, test, and delete OpenStack resources. Each of our three Availability Zones (AZs) has two of these processes. These monitoring cycles repeat every 5 to 10 minutes. Approximately once every 1 to 2 days, one of these monitoring checks fails with the following error: https://paste.openstack.org/show/bt3BWnyrBbQspiFx8KPZ/ """ 2025-03-12 18:42:17,764 ERROR: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'neutronclient.common.exceptions.InternalServerError'> (HTTP 500) (Request-ID: req-3b755d12-27b1-4531-b948-b613520c05ae) """ Upon investigating this Request-ID in the nova-api logs, we found that nova reports an InternalServerError coming from neutron, along with a neutron Request-ID: https://paste.openstack.org/show/b6wGVVZNwaAuLmFgHvaq/ """ 2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi neutronclient.common.exceptions.InternalServerError: The server has either erred or is incapable of performing the requested operation.<br /><br /> """ Further investigation of that Request-ID in neutron logs revealed this specific error: https://paste.openstack.org/show/bJL9b5aY99XpZlLZdzgy/ """ 2025-03-12 18:42:17.754 24 ERROR oslo_middleware.catch_errors oslo_db.exception.RetryRequest: Network 6163d404-aa33-4d83-a8a4-ead7dc0d9e14 could not be found. """ Additionally, looking at the HTTP DELETE request, we confirmed that the network was indeed deleted just milliseconds (0.005s) before the nova error occurred: """ log_level:INFO, logger:openstack.neutron, programname:neutron-server, python_module:neutron.wsgi, timestamp:2025-03-12 18:42:17.749 payload: XX.XXX.XX.XXX,10.170.65.35 \"DELETE /v2.0/networks/6163d404-aa33-4d83-a8a4-ead7dc0d9e14 HTTP/1.0\" status: 204 len: 168 time: 0.6581874 """ This specific example was pretty close in timing — the network deletion happened just 0.005 seconds before the nova error. In other occurrences, the gap between the DELETE and the nova error was slightly larger (typically 0.5 to 1.2 seconds), but the same errors where triggered. What we suspect =============== What we suspect is happening is that at the creation part of a instance, nova requests all ports from neutron using the neutronclient. """ 2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/network/neutron.py", line 2664, in validate_networks 2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi ports = neutron.list_ports(**params)['ports'] """ We believe that either Neutron or neutronclient tries to fetch the network details for each port it has already gathered in a list. Since it has collected all the ports in the project, it then attempts to fetch all the information of each ports network in the project. The issue happens if another process (in our case, one of the other five Python scripts) deletes a network right after Neutron has gathered the ports but just before it tries to fetch the network details. When that happens, Neutron tries to look up the now-deleted network, triggering the error because the network was removed a split second earlier. Temporary workaround ==================== We'd figured out a workaround for this problem: /nova/nova/network/neutron.py:2715-2717 """ if quotas.get('port', -1) == -1: # Unlimited Port Quota return num_instances """ Setting the port quota of the project in question to unlimited (-1). Makes sure that validate_networks() just returns num_instances and skips the part where neutron.list_ports() is called (line 2721). This has for now fixed the problem for us. Steps to reproduce ================== So far, all occurrences have happened randomly during automated processes. The steps below are based on my understanding and speculation of what’s happening: 1. Create two private networks in the same project — Network A and Network B. 2. Ensure that both networks contain one or more ports. 3. Request to create an instance attached to Network A using openstacksdk. - During this process, nova-api calls neutron.list_ports(). - Neutron gathers all ports in the project. 4. Delete Network B — make sure this happens right AFTER Neutron has finished gathering all ports, but right BEFORE Neutron starts fetching network details for the ports it collected. - Neutron attempts to fetch the network information for all gathered ports, which includes Network B (now deleted). - Since Network B no longer exists, Neutron fails to find it, causing an error. X. As a result, the nova-api call and the openstacksdk operation fail. Expected result =============== The expected result is that either nova, neutron or neutronclient recognizes that the network is deleted. And either recollects all ports again or ignores the ports from the deleted network. So that nova can continue to run the instance creation. Actual result ============= Already described in description. Environment =========== OpenStack 2023.1 deployed w/ ansible-kolla Ubuntu 24.04 Python scripts: python3.8.10 w/ openstacksdk 4.0.1 Neutron networking To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2103501/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp