Public bug reported:

neutron-ovn-metadata-agent uses network namespaces to separate the
metadata services for individual networks. For each network it
automatically creates or destroys an appropriate namespace.

If the metadata agent dies for reasons outside of its control (e.g. a
SIGKILL) during the process of namespace destruction a broken namespace
can be left over.

---
Background on pyroute2 namespace management:

Creating a network namespace works by:
1. Forking the process and doing everything in the new child
2. Ensuring /var/run/netns exists
3. Ensuring the file for the network namespace under /var/run/netns exists by 
creating a new empty file
4. calling `unshare` with `CLONE_NEWNET` to move the process to a new network 
namespace
5. Creating a bind mount from `/proc/self/ns/net` to the file under 
/var/run/netns

Deleting a network namespace works the other way around (but shorter):
1. Unmounting the previously created bind mount
2. Deleting the file for the network namespace

---

If the neutron-ovn-metadata-agent is killed between step 1 and 2 of
deleting the network namespace then the namespace file will still be
around, but not point to any namespace.

When `garbage_collect_namespace` tries to check if the namespace is empty it 
tries to enter the network namespace to dump all devices in there. This raises 
an exception as the namespace can no longer be entered.
neutron-ovn-metadata-agent then crashes and tries again next time, crashing 
again.


```
Traceback (most recent call last):,
   File "/usr/local/bin/neutron-ovn-metadata-agent", line 8, in <module>,
     sys.exit(main()),
   File 
"/usr/local/lib/python3.9/site-packages/neutron/cmd/eventlet/agents/ovn_metadata.py",
 line 24, in main,
     metadata_agent.main(),
   File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata_agent.py", 
line 41, in main,
     agt.start(),
   File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 277, in start,
     self.sync(),
   File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 61, in wrapped,
     return f(*args, **kwargs),
   File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 349, in sync,
     self.teardown_datapath(self._get_datapath_name(ns)),
   File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 400, in teardown_datapath,
     ip.garbage_collect_namespace(),
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", 
line 268, in garbage_collect_namespace,
     if self.namespace_is_empty():,
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", 
line 263, in namespace_is_empty,
     return not self.get_devices(),
   File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", 
line 180, in get_devices,
     devices = privileged.get_device_names(self.namespace),
   File 
"/usr/local/lib/python3.9/site-packages/neutron/privileged/agent/linux/ip_lib.py",
 line 609, in get_device_names,
     in get_link_devices(namespace, **kwargs)],
   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 
333, in wrapped_f,
     return self(f, *args, **kw),
   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 
423, in __call__,
     do = self.iter(retry_state=retry_state),
   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 
360, in iter,
     return fut.result(),
   File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in 
result,
     return self.__get_result(),
   File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in 
__get_result,
     raise self._exception,
   File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 
426, in __call__,
     result = fn(*args, **kwargs),
   File "/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", 
line 271, in _wrap,
     return self.channel.remote_call(name, args, kwargs,,
   File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 
215, in remote_call,
     raise exc_type(*result[2]),
OSError: [Errno 22] failed to open netns
```


Versions: afaik affects all versions

Reproduction: best by creating a empty file with the name
`/var/run/netns/ovnmeta-<some-uuid>` and restarting the neutron-ovn-
metadata-agent. Otherwise a breakpoint or a good timed kill command

** Affects: neutron
     Importance: Undecided
         Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2037102

Title:
  neutron-ovn-metadata-agent dies on broken namspace

Status in neutron:
  In Progress

Bug description:
  neutron-ovn-metadata-agent uses network namespaces to separate the
  metadata services for individual networks. For each network it
  automatically creates or destroys an appropriate namespace.

  If the metadata agent dies for reasons outside of its control (e.g. a
  SIGKILL) during the process of namespace destruction a broken
  namespace can be left over.

  ---
  Background on pyroute2 namespace management:

  Creating a network namespace works by:
  1. Forking the process and doing everything in the new child
  2. Ensuring /var/run/netns exists
  3. Ensuring the file for the network namespace under /var/run/netns exists by 
creating a new empty file
  4. calling `unshare` with `CLONE_NEWNET` to move the process to a new network 
namespace
  5. Creating a bind mount from `/proc/self/ns/net` to the file under 
/var/run/netns

  Deleting a network namespace works the other way around (but shorter):
  1. Unmounting the previously created bind mount
  2. Deleting the file for the network namespace

  ---

  If the neutron-ovn-metadata-agent is killed between step 1 and 2 of
  deleting the network namespace then the namespace file will still be
  around, but not point to any namespace.

  When `garbage_collect_namespace` tries to check if the namespace is empty it 
tries to enter the network namespace to dump all devices in there. This raises 
an exception as the namespace can no longer be entered.
  neutron-ovn-metadata-agent then crashes and tries again next time, crashing 
again.

  
  ```
  Traceback (most recent call last):,
     File "/usr/local/bin/neutron-ovn-metadata-agent", line 8, in <module>,
       sys.exit(main()),
     File 
"/usr/local/lib/python3.9/site-packages/neutron/cmd/eventlet/agents/ovn_metadata.py",
 line 24, in main,
       metadata_agent.main(),
     File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata_agent.py", 
line 41, in main,
       agt.start(),
     File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 277, in start,
       self.sync(),
     File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 61, in wrapped,
       return f(*args, **kwargs),
     File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 349, in sync,
       self.teardown_datapath(self._get_datapath_name(ns)),
     File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 400, in teardown_datapath,
       ip.garbage_collect_namespace(),
     File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 
268, in garbage_collect_namespace,
       if self.namespace_is_empty():,
     File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 
263, in namespace_is_empty,
       return not self.get_devices(),
     File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 
180, in get_devices,
       devices = privileged.get_device_names(self.namespace),
     File 
"/usr/local/lib/python3.9/site-packages/neutron/privileged/agent/linux/ip_lib.py",
 line 609, in get_device_names,
       in get_link_devices(namespace, **kwargs)],
     File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 
333, in wrapped_f,
       return self(f, *args, **kw),
     File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 
423, in __call__,
       do = self.iter(retry_state=retry_state),
     File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 
360, in iter,
       return fut.result(),
     File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in 
result,
       return self.__get_result(),
     File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in 
__get_result,
       raise self._exception,
     File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 
426, in __call__,
       result = fn(*args, **kwargs),
     File 
"/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 
271, in _wrap,
       return self.channel.remote_call(name, args, kwargs,,
     File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 
215, in remote_call,
       raise exc_type(*result[2]),
  OSError: [Errno 22] failed to open netns
  ```

  
  Versions: afaik affects all versions

  Reproduction: best by creating a empty file with the name
  `/var/run/netns/ovnmeta-<some-uuid>` and restarting the neutron-ovn-
  metadata-agent. Otherwise a breakpoint or a good timed kill command

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2037102/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to