Public bug reported: We have hit an issue whereby nodes running Neutron Ussuri with ML2 L3HA are occasionally running into the following:
On a 3 node neutron-gateway with approx 280 HA routers, the l3-agent sometimes gets into a state where it is repeatedly trying to spawn a metadata proxy (haproxy) for a router that no longer exists and fails because the namespace is no longer there. This happens thousands of times a day and basically blocks the l3-agent from processing other updates. The error looks like: 2022-05-21 06:26:12.882 30127 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap. conf', 'ip', 'netns', 'exec', 'qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/57837a95-ed3b -4a1b-9393-1374a8c744c3.conf'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88 2022-05-21 06:26:13.116 30127 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3": No such file or directory Some background; when the l3-agent starts is subscribes callback methods to certain events. One of those events is "before_delete" [1] and the method is neutron.agent.metadata.driver.before_router_removed. The idea here is that when an update to delete a router is received it will first execute this method which will delete the "neutron metadata-proxy monitor" which looks at the haproxy pid and respawns it if it dies. A successful callback execution looks like: 2022-05-21 03:05:54.676 30127 DEBUG neutron_lib.callbacks.manager [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193 2022-05-21 03:05:54.676 30127 DEBUG neutron.agent.linux.utils [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-15', '26363'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88 And an unsuccessful one looks like: 2022-05-10 23:36:10.480 30127 INFO neutron.agent.l3.ha [-] Router 57837a95-ed3b-4a1b-9393-1374a8c744c3 transitioned to master on agent sgdemr0114bp007 ... 2022-05-10 23:36:10.646 30127 DEBUG neutron_lib.callbacks.manager [req-6bfaa057-0ab9-450c-b27f-d4008fd7f9f1 - a87539ab4d2e4423b28ae6634e0d9c25 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193 2022-05-10 23:36:10.853 30127 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 57837a95-ed3b-4a1b-9393-1374a8c744c3 _update_metadata_proxy /usr/lib/python3/dist-packages/neutron/agent/l3/ha.py:219 The difference being that instead of killing the proxy monitor it is actually spawning a new one! The other thing to notice is that in the second case it isn't servicing the same request (it has "[-]"). I looked at the code and found that this is because when the router transitions to master the l3-agent puts the thread to eventlet.sleep(2) [2] and then proceeds with the update i.e. spawning the metadata proxy and while it was asleep it started to process the before_delete callback but then got pre-empted. So this looks like a simple race condition and occurs if a router transitions to master at the same time as being deleted. A simple interim workaround is to manually create the non-existent namespace which will allow the respawn to succeed and then hopefully the callback gets to run and deletes it again to clean up. That or restart your neutron-l3-agent service. [1] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/metadata/driver.py#L186 [2] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/l3/ha.py#L149 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1979089 Title: l3ha router delete race condition if state changes to master Status in neutron: New Bug description: We have hit an issue whereby nodes running Neutron Ussuri with ML2 L3HA are occasionally running into the following: On a 3 node neutron-gateway with approx 280 HA routers, the l3-agent sometimes gets into a state where it is repeatedly trying to spawn a metadata proxy (haproxy) for a router that no longer exists and fails because the namespace is no longer there. This happens thousands of times a day and basically blocks the l3-agent from processing other updates. The error looks like: 2022-05-21 06:26:12.882 30127 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap. conf', 'ip', 'netns', 'exec', 'qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/57837a95-ed3b -4a1b-9393-1374a8c744c3.conf'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88 2022-05-21 06:26:13.116 30127 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3": No such file or directory Some background; when the l3-agent starts is subscribes callback methods to certain events. One of those events is "before_delete" [1] and the method is neutron.agent.metadata.driver.before_router_removed. The idea here is that when an update to delete a router is received it will first execute this method which will delete the "neutron metadata-proxy monitor" which looks at the haproxy pid and respawns it if it dies. A successful callback execution looks like: 2022-05-21 03:05:54.676 30127 DEBUG neutron_lib.callbacks.manager [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193 2022-05-21 03:05:54.676 30127 DEBUG neutron.agent.linux.utils [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-15', '26363'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88 And an unsuccessful one looks like: 2022-05-10 23:36:10.480 30127 INFO neutron.agent.l3.ha [-] Router 57837a95-ed3b-4a1b-9393-1374a8c744c3 transitioned to master on agent sgdemr0114bp007 ... 2022-05-10 23:36:10.646 30127 DEBUG neutron_lib.callbacks.manager [req-6bfaa057-0ab9-450c-b27f-d4008fd7f9f1 - a87539ab4d2e4423b28ae6634e0d9c25 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193 2022-05-10 23:36:10.853 30127 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 57837a95-ed3b-4a1b-9393-1374a8c744c3 _update_metadata_proxy /usr/lib/python3/dist-packages/neutron/agent/l3/ha.py:219 The difference being that instead of killing the proxy monitor it is actually spawning a new one! The other thing to notice is that in the second case it isn't servicing the same request (it has "[-]"). I looked at the code and found that this is because when the router transitions to master the l3-agent puts the thread to eventlet.sleep(2) [2] and then proceeds with the update i.e. spawning the metadata proxy and while it was asleep it started to process the before_delete callback but then got pre-empted. So this looks like a simple race condition and occurs if a router transitions to master at the same time as being deleted. A simple interim workaround is to manually create the non-existent namespace which will allow the respawn to succeed and then hopefully the callback gets to run and deletes it again to clean up. That or restart your neutron-l3-agent service. [1] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/metadata/driver.py#L186 [2] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/l3/ha.py#L149 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1979089/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

