[Yahoo-eng-team] [Bug 2103501] [NEW] Race condition - Error creating instance if network is deleted by different process

Robbert Nijgh Fri, 04 Apr 2025 12:26:58 -0700

Public bug reported:

Note: I'm creating this bug report both at nova and in neutron because
I'm not sure where the bug is present.


Description
===========
We run automated Python scripts on our monitoring system that create, test, and 
delete OpenStack resources. Each of our three Availability Zones (AZs) has two 
of these processes. These monitoring cycles repeat every 5 to 10 minutes.
Approximately once every 1 to 2 days, one of these monitoring checks fails with 
the following error:

https://paste.openstack.org/show/bt3BWnyrBbQspiFx8KPZ/
        """
        2025-03-12 18:42:17,764 ERROR: Unexpected API Error. Please report 
this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
        <class 'neutronclient.common.exceptions.InternalServerError'> (HTTP 
500) (Request-ID: req-3b755d12-27b1-4531-b948-b613520c05ae)
        """

Upon investigating this Request-ID in the nova-api logs, we found that nova 
reports an InternalServerError coming from neutron, along with a neutron 
Request-ID:
https://paste.openstack.org/show/b6wGVVZNwaAuLmFgHvaq/
        """
        2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi 
neutronclient.common.exceptions.InternalServerError: The server has either 
erred or is incapable of performing the requested operation.<br /><br />
        """

Further investigation of that Request-ID in neutron logs revealed this specific 
error:
https://paste.openstack.org/show/bJL9b5aY99XpZlLZdzgy/
        """
        2025-03-12 18:42:17.754 24 ERROR oslo_middleware.catch_errors 
oslo_db.exception.RetryRequest: Network 6163d404-aa33-4d83-a8a4-ead7dc0d9e14 
could not be found.
        """

Additionally, looking at the HTTP DELETE request, we confirmed that the network 
was indeed deleted just milliseconds (0.005s) before the nova error occurred:
        """
        log_level:INFO,
        logger:openstack.neutron,
        programname:neutron-server,
        python_module:neutron.wsgi,
        timestamp:2025-03-12 18:42:17.749
        payload:
        XX.XXX.XX.XXX,10.170.65.35 \"DELETE 
/v2.0/networks/6163d404-aa33-4d83-a8a4-ead7dc0d9e14 HTTP/1.0\" status: 204  
len: 168 time: 0.6581874
        """
This specific example was pretty close in timing — the network deletion 
happened just 0.005 seconds before the nova error.
In other occurrences, the gap between the DELETE and the nova error was 
slightly larger (typically 0.5 to 1.2 seconds), but the same errors where 
triggered.


What we suspect
===============
What we suspect is happening is that at the creation part of a instance, nova 
requests all ports from neutron using the neutronclient.
        """
        2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/network/neutron.py", 
line 2664, in validate_networks
        2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi     ports = 
neutron.list_ports(**params)['ports']
        """
We believe that either Neutron or neutronclient tries to fetch the network 
details for each port it has already gathered in a list. Since it has collected 
all the ports in the project, it then attempts to fetch all the information of 
each ports network in the project. The issue happens if another process (in our 
case, one of the other five Python scripts) deletes a network right after 
Neutron has gathered the ports but just before it tries to fetch the network 
details. When that happens, Neutron tries to look up the now-deleted network, 
triggering the error because the network was removed a split second earlier.


Temporary workaround
====================
We'd figured out a workaround for this problem:
/nova/nova/network/neutron.py:2715-2717
        """
                    if quotas.get('port', -1) == -1:
                        # Unlimited Port Quota
                        return num_instances
        """
Setting the port quota of the project in question to unlimited (-1). Makes sure 
that validate_networks() just returns num_instances and skips the part where 
neutron.list_ports() is called (line 2721). This has for now fixed the problem 
for us.


Steps to reproduce
==================
So far, all occurrences have happened randomly during automated processes. The 
steps below are based on my understanding and speculation of what’s happening:
1. Create two private networks in the same project — Network A and Network B.
2. Ensure that both networks contain one or more ports.
3. Request to create an instance attached to Network A using openstacksdk. 
        - During this process, nova-api calls neutron.list_ports().
        - Neutron gathers all ports in the project.
4. Delete Network B — make sure this happens right AFTER Neutron has finished 
gathering all ports, but right BEFORE Neutron starts fetching network details 
for the ports it collected.
        - Neutron attempts to fetch the network information for all gathered 
ports, which includes Network B (now deleted).
        - Since Network B no longer exists, Neutron fails to find it, causing 
an error.
X. As a result, the nova-api call and the openstacksdk operation fail.


Expected result
===============
The expected result is that either nova, neutron or neutronclient recognizes 
that the network is deleted. And either recollects all ports again or ignores 
the ports from the deleted network. So that nova can continue to run the 
instance creation.


Actual result
=============
Already described in description.


Environment
===========
OpenStack 2023.1 deployed w/ ansible-kolla
Ubuntu 24.04
Python scripts: python3.8.10 w/ openstacksdk 4.0.1
Neutron networking

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2103501

Title:
  Race condition - Error creating instance if network is deleted by
  different process

Status in neutron:
  New

Bug description:
  Note: I'm creating this bug report both at nova and in neutron because
  I'm not sure where the bug is present.

  Description
  ===========
  We run automated Python scripts on our monitoring system that create, test, 
and delete OpenStack resources. Each of our three Availability Zones (AZs) has 
two of these processes. These monitoring cycles repeat every 5 to 10 minutes.
  Approximately once every 1 to 2 days, one of these monitoring checks fails 
with the following error:

  https://paste.openstack.org/show/bt3BWnyrBbQspiFx8KPZ/
        """
        2025-03-12 18:42:17,764 ERROR: Unexpected API Error. Please report 
this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
        <class 'neutronclient.common.exceptions.InternalServerError'> (HTTP 
500) (Request-ID: req-3b755d12-27b1-4531-b948-b613520c05ae)
        """

  Upon investigating this Request-ID in the nova-api logs, we found that nova 
reports an InternalServerError coming from neutron, along with a neutron 
Request-ID:
  https://paste.openstack.org/show/b6wGVVZNwaAuLmFgHvaq/
        """
        2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi 
neutronclient.common.exceptions.InternalServerError: The server has either 
erred or is incapable of performing the requested operation.<br /><br />
        """

  Further investigation of that Request-ID in neutron logs revealed this 
specific error:
  https://paste.openstack.org/show/bJL9b5aY99XpZlLZdzgy/
        """
        2025-03-12 18:42:17.754 24 ERROR oslo_middleware.catch_errors 
oslo_db.exception.RetryRequest: Network 6163d404-aa33-4d83-a8a4-ead7dc0d9e14 
could not be found.
        """

  Additionally, looking at the HTTP DELETE request, we confirmed that the 
network was indeed deleted just milliseconds (0.005s) before the nova error 
occurred:
        """
        log_level:INFO,
        logger:openstack.neutron,
        programname:neutron-server,
        python_module:neutron.wsgi,
        timestamp:2025-03-12 18:42:17.749
        payload:
        XX.XXX.XX.XXX,10.170.65.35 \"DELETE 
/v2.0/networks/6163d404-aa33-4d83-a8a4-ead7dc0d9e14 HTTP/1.0\" status: 204  
len: 168 time: 0.6581874
        """
  This specific example was pretty close in timing — the network deletion 
happened just 0.005 seconds before the nova error.
  In other occurrences, the gap between the DELETE and the nova error was 
slightly larger (typically 0.5 to 1.2 seconds), but the same errors where 
triggered.

  
  What we suspect
  ===============
  What we suspect is happening is that at the creation part of a instance, nova 
requests all ports from neutron using the neutronclient.
        """
        2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/network/neutron.py", 
line 2664, in validate_networks
        2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi     ports = 
neutron.list_ports(**params)['ports']
        """
  We believe that either Neutron or neutronclient tries to fetch the network 
details for each port it has already gathered in a list. Since it has collected 
all the ports in the project, it then attempts to fetch all the information of 
each ports network in the project. The issue happens if another process (in our 
case, one of the other five Python scripts) deletes a network right after 
Neutron has gathered the ports but just before it tries to fetch the network 
details. When that happens, Neutron tries to look up the now-deleted network, 
triggering the error because the network was removed a split second earlier.

  
  Temporary workaround
  ====================
  We'd figured out a workaround for this problem:
  /nova/nova/network/neutron.py:2715-2717
        """
                    if quotas.get('port', -1) == -1:
                        # Unlimited Port Quota
                        return num_instances
        """
  Setting the port quota of the project in question to unlimited (-1). Makes 
sure that validate_networks() just returns num_instances and skips the part 
where neutron.list_ports() is called (line 2721). This has for now fixed the 
problem for us.

  
  Steps to reproduce
  ==================
  So far, all occurrences have happened randomly during automated processes. 
The steps below are based on my understanding and speculation of what’s 
happening:
  1. Create two private networks in the same project — Network A and Network B.
  2. Ensure that both networks contain one or more ports.
  3. Request to create an instance attached to Network A using openstacksdk. 
        - During this process, nova-api calls neutron.list_ports().
        - Neutron gathers all ports in the project.
  4. Delete Network B — make sure this happens right AFTER Neutron has finished 
gathering all ports, but right BEFORE Neutron starts fetching network details 
for the ports it collected.
        - Neutron attempts to fetch the network information for all gathered 
ports, which includes Network B (now deleted).
        - Since Network B no longer exists, Neutron fails to find it, causing 
an error.
  X. As a result, the nova-api call and the openstacksdk operation fail.

  
  Expected result
  ===============
  The expected result is that either nova, neutron or neutronclient recognizes 
that the network is deleted. And either recollects all ports again or ignores 
the ports from the deleted network. So that nova can continue to run the 
instance creation.

  
  Actual result
  =============
  Already described in description.

  
  Environment
  ===========
  OpenStack 2023.1 deployed w/ ansible-kolla
  Ubuntu 24.04
  Python scripts: python3.8.10 w/ openstacksdk 4.0.1
  Neutron networking

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2103501/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2103501] [NEW] Race condition - Error creating instance if network is deleted by different process

Reply via email to