Public bug reported: `ip_monitor` form `neutron/agent/linux/ip_lib.py` has high possibility to deadlock during shutdown.
Code assumes EOFError exception will be thrown, which introduces race condition. During code analyze i found that both implementations pyroute2.iproute.IPRoute and pyroute2.nslink.nslink.NetNS used by ip_monitor during call of close sends `NetlinkError(104, 'Connection reset by peer')` back to receiving thread. I believe that for proper handling of event_stop, read_ip_updates thread should handle this error instead of relaying on EOFError exception. For tests i created test network namespace, and flapped link state of loopback interface (but error can occur without any changes inside namespace: ``` sudo ip netns create test sudo ip netns exec test bash while [ 1 ] ; do ip l s down lo ; sleep 1 ; ip l s up lo ; sleep 1; done ``` Fail result can be reproduced using neutron test: ``` sudo timeout 3 python neutron/tests/functional/agent/linux/bin/ip_monitor.py temp_file test ``` As for now i know that this affects neutron_l3_agent in ovs deployments since 23.2.0 (we discovered it happens till commit https://github.com/openstack/neutron/commit/a7aeec703de2b1db2849da206fa349037ce23a0e). I assume changing of pidfile management broke some cleanup process that handles deadlocked daemons. I verified in our lab that neutron_l3_agent processes leak can by fixed by simple fix like: ``` diff --git a/neutron/agent/linux/ip_lib.py b/neutron/agent/linux/ip_lib.py index c9b42d83dd..1c5274b5b5 100644 --- a/neutron/agent/linux/ip_lib.py +++ b/neutron/agent/linux/ip_lib.py @@ -1534,6 +1534,8 @@ def ip_monitor(namespace, queue, event_stop, event_started): while True: ip_addresses = _ip.get() for ip_address in ip_addresses: + if ip_address.get('error') == errno.ECONNRESET: + return LOG.debug("IP monitor %s; Adding IP address: %s " "to the queue.", namespace, ip_address) _queue.put(ip_address) ``` As of bug inside ip_monitor i reproduced it on all neutron versions between 16.0.0 and 25.1.0 using pyroute2==0.8.1 I did not run tests on other platfroms than linux (fedora 41 and ubuntu 22.04.5 LTS (Jammy Jellyfish)) ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2104979 Title: Possible deadlock on closing ip_monitor thread. Status in neutron: New Bug description: `ip_monitor` form `neutron/agent/linux/ip_lib.py` has high possibility to deadlock during shutdown. Code assumes EOFError exception will be thrown, which introduces race condition. During code analyze i found that both implementations pyroute2.iproute.IPRoute and pyroute2.nslink.nslink.NetNS used by ip_monitor during call of close sends `NetlinkError(104, 'Connection reset by peer')` back to receiving thread. I believe that for proper handling of event_stop, read_ip_updates thread should handle this error instead of relaying on EOFError exception. For tests i created test network namespace, and flapped link state of loopback interface (but error can occur without any changes inside namespace: ``` sudo ip netns create test sudo ip netns exec test bash while [ 1 ] ; do ip l s down lo ; sleep 1 ; ip l s up lo ; sleep 1; done ``` Fail result can be reproduced using neutron test: ``` sudo timeout 3 python neutron/tests/functional/agent/linux/bin/ip_monitor.py temp_file test ``` As for now i know that this affects neutron_l3_agent in ovs deployments since 23.2.0 (we discovered it happens till commit https://github.com/openstack/neutron/commit/a7aeec703de2b1db2849da206fa349037ce23a0e). I assume changing of pidfile management broke some cleanup process that handles deadlocked daemons. I verified in our lab that neutron_l3_agent processes leak can by fixed by simple fix like: ``` diff --git a/neutron/agent/linux/ip_lib.py b/neutron/agent/linux/ip_lib.py index c9b42d83dd..1c5274b5b5 100644 --- a/neutron/agent/linux/ip_lib.py +++ b/neutron/agent/linux/ip_lib.py @@ -1534,6 +1534,8 @@ def ip_monitor(namespace, queue, event_stop, event_started): while True: ip_addresses = _ip.get() for ip_address in ip_addresses: + if ip_address.get('error') == errno.ECONNRESET: + return LOG.debug("IP monitor %s; Adding IP address: %s " "to the queue.", namespace, ip_address) _queue.put(ip_address) ``` As of bug inside ip_monitor i reproduced it on all neutron versions between 16.0.0 and 25.1.0 using pyroute2==0.8.1 I did not run tests on other platfroms than linux (fedora 41 and ubuntu 22.04.5 LTS (Jammy Jellyfish)) To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2104979/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp