Public bug reported:

`ip_monitor` form `neutron/agent/linux/ip_lib.py` has high possibility
to deadlock during shutdown.

Code assumes EOFError exception will be thrown, which introduces race
condition.

During code analyze i found that both implementations
pyroute2.iproute.IPRoute and pyroute2.nslink.nslink.NetNS used by
ip_monitor during call of close sends `NetlinkError(104, 'Connection
reset by peer')` back to receiving thread. I believe that for proper
handling of event_stop, read_ip_updates thread should handle this error
instead of relaying on EOFError exception.


For tests i created test network namespace, and flapped link state of loopback 
interface (but error can occur without any changes inside namespace:
```
sudo ip netns create test
sudo ip netns exec test bash
  while [ 1 ] ; do ip l s down lo ; sleep 1 ; ip l s up lo ; sleep 1; done
```

Fail result can be reproduced using neutron test:
```
sudo timeout 3 python neutron/tests/functional/agent/linux/bin/ip_monitor.py 
temp_file test
```

As for now i know that this affects neutron_l3_agent in ovs deployments
since 23.2.0 (we discovered it happens till commit
https://github.com/openstack/neutron/commit/a7aeec703de2b1db2849da206fa349037ce23a0e).
I assume changing of pidfile management broke some cleanup process that
handles deadlocked daemons.


I verified in our lab that neutron_l3_agent processes leak can by fixed by 
simple fix like:
```
diff --git a/neutron/agent/linux/ip_lib.py b/neutron/agent/linux/ip_lib.py
index c9b42d83dd..1c5274b5b5 100644
--- a/neutron/agent/linux/ip_lib.py
+++ b/neutron/agent/linux/ip_lib.py
@@ -1534,6 +1534,8 @@ def ip_monitor(namespace, queue, event_stop, 
event_started):
             while True:
                 ip_addresses = _ip.get()
                 for ip_address in ip_addresses:
+                    if ip_address.get('error') == errno.ECONNRESET:
+                        return
                     LOG.debug("IP monitor %s; Adding IP address: %s "
                               "to the queue.", namespace, ip_address)
                     _queue.put(ip_address)
```

As of bug inside ip_monitor i reproduced it on all neutron versions
between 16.0.0 and 25.1.0 using pyroute2==0.8.1

I did not run tests on other platfroms than linux (fedora 41 and ubuntu
22.04.5 LTS (Jammy Jellyfish))

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2104979

Title:
  Possible deadlock on closing ip_monitor thread.

Status in neutron:
  New

Bug description:
  `ip_monitor` form `neutron/agent/linux/ip_lib.py` has high possibility
  to deadlock during shutdown.

  Code assumes EOFError exception will be thrown, which introduces race
  condition.

  During code analyze i found that both implementations
  pyroute2.iproute.IPRoute and pyroute2.nslink.nslink.NetNS used by
  ip_monitor during call of close sends `NetlinkError(104, 'Connection
  reset by peer')` back to receiving thread. I believe that for proper
  handling of event_stop, read_ip_updates thread should handle this
  error instead of relaying on EOFError exception.

  
  For tests i created test network namespace, and flapped link state of 
loopback interface (but error can occur without any changes inside namespace:
  ```
  sudo ip netns create test
  sudo ip netns exec test bash
    while [ 1 ] ; do ip l s down lo ; sleep 1 ; ip l s up lo ; sleep 1; done
  ```

  Fail result can be reproduced using neutron test:
  ```
  sudo timeout 3 python neutron/tests/functional/agent/linux/bin/ip_monitor.py 
temp_file test
  ```

  As for now i know that this affects neutron_l3_agent in ovs
  deployments since 23.2.0 (we discovered it happens till commit
  
https://github.com/openstack/neutron/commit/a7aeec703de2b1db2849da206fa349037ce23a0e).
  I assume changing of pidfile management broke some cleanup process
  that handles deadlocked daemons.

  
  I verified in our lab that neutron_l3_agent processes leak can by fixed by 
simple fix like:
  ```
  diff --git a/neutron/agent/linux/ip_lib.py b/neutron/agent/linux/ip_lib.py
  index c9b42d83dd..1c5274b5b5 100644
  --- a/neutron/agent/linux/ip_lib.py
  +++ b/neutron/agent/linux/ip_lib.py
  @@ -1534,6 +1534,8 @@ def ip_monitor(namespace, queue, event_stop, 
event_started):
               while True:
                   ip_addresses = _ip.get()
                   for ip_address in ip_addresses:
  +                    if ip_address.get('error') == errno.ECONNRESET:
  +                        return
                       LOG.debug("IP monitor %s; Adding IP address: %s "
                                 "to the queue.", namespace, ip_address)
                       _queue.put(ip_address)
  ```

  As of bug inside ip_monitor i reproduced it on all neutron versions
  between 16.0.0 and 25.1.0 using pyroute2==0.8.1

  I did not run tests on other platfroms than linux (fedora 41 and
  ubuntu 22.04.5 LTS (Jammy Jellyfish))

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2104979/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to