[Yahoo-eng-team] [Bug 2048848] [NEW] get_power_state blocked
Public bug reported: Description --- When the network for an rbd (RADOS Block Device) storage disconnects due to a failure, `get_power_state` becomes blocked when attempting to query the power state of a virtual machine. The goal is to check the power status and migrate online VMs. However, when the periodic monitoring program `domstats` hangs while accessing the disconnected storage, it causes libvirt's rpc-worker to be occupied for extended periods. In scenarios with multiple virtual machines, querying the power status interface also gets delayed and cannot be executed immediately. Steps to reproduce -- 1. Disconnect the network for the rbd storage. 2. Schedule `domstats` to run every 10 seconds. Expected result --- The expected outcome is to switch to a higher-priority interface within libvirt, such as using `domain.state()` possibly in conjunction with a priority RPC mechanism like `prio-rpc`. This would ensure that critical operations, including querying power states and conducting necessary migrations, are prioritized and can still be executed promptly even under resource-constrained conditions. ** Affects: nova Importance: Undecided Assignee: Yalei Li (chetaiyong) Status: New ** Changed in: nova Assignee: (unassigned) => Yalei Li (chetaiyong) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2048848 Title: get_power_state blocked Status in OpenStack Compute (nova): New Bug description: Description --- When the network for an rbd (RADOS Block Device) storage disconnects due to a failure, `get_power_state` becomes blocked when attempting to query the power state of a virtual machine. The goal is to check the power status and migrate online VMs. However, when the periodic monitoring program `domstats` hangs while accessing the disconnected storage, it causes libvirt's rpc-worker to be occupied for extended periods. In scenarios with multiple virtual machines, querying the power status interface also gets delayed and cannot be executed immediately. Steps to reproduce -- 1. Disconnect the network for the rbd storage. 2. Schedule `domstats` to run every 10 seconds. Expected result --- The expected outcome is to switch to a higher-priority interface within libvirt, such as using `domain.state()` possibly in conjunction with a priority RPC mechanism like `prio-rpc`. This would ensure that critical operations, including querying power states and conducting necessary migrations, are prioritized and can still be executed promptly even under resource-constrained conditions. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2048848/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2048874] [NEW] group_policy flavor extra spec is not compatible with AggregateInstanceExtraSpecsFilter
Public bug reported: Effectively adding this extra spec to use the 'granular resource request' feature of placement also requires that all the computes such flavor attempts to target are added into an aggregate with the metadata 'group_policy' set and equal to 'none' or 'isolate'. We either have to finally move the group_policy extra spec to its own namespace (there's a TODO in the code for that, similar to what has been done to hide_hypervisor_id), or explicitly ignore this key in AggregateInstanceExtraSpecsFilter. example: using the flavor with the following extra spec group_policy='none', resources1:CUSTOM_MIG_1G_5GB='1', resources2:CUSTOM_MIG_1G_5GB='1' and having no aggregates, I get the following error in the scheduler log 2024-01-09 22:00:03.045 1 DEBUG nova.filters [None req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b d7355faecc8c45edb7d7de3837df6fd9 - - default default] Starting with 1 host(s) get_filtered_objects /var/lib/openstack/lib/python3.10/site-packages/nova/filters.py:70 2024-01-09 22:00:03.046 1 DEBUG nova.scheduler.filters.aggregate_instance_extra_specs [None req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b d7355faecc8c45edb7d7de3837df6fd9 - - default default] (kaas-node-f5f4a99c-6783-4b5f-b42a-0d772e1b0b11, kaas-node-f5f4a99c-6783-4b5f-b42a-0d772e1b0b11.kaas-kubernetes-4d64eb64810c48b3b5ac17da6a77eede) ram: 192851MB disk: 658432MB io_ops: 0 instances: 0, allocation_candidates: 6 fails flavor extra_specs requirements. Extra_spec group_policy is not in aggregate. host_passes /var/lib/openstack/lib/python3.10/site-packages/nova/scheduler/filters/aggregate_instance_extra_specs.py:63 2024-01-09 22:00:03.046 1 INFO nova.filters [None req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b d7355faecc8c45edb7d7de3837df6fd9 - - default default] Filter AggregateInstanceExtraSpecsFilter returned 0 hosts ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2048874 Title: group_policy flavor extra spec is not compatible with AggregateInstanceExtraSpecsFilter Status in OpenStack Compute (nova): New Bug description: Effectively adding this extra spec to use the 'granular resource request' feature of placement also requires that all the computes such flavor attempts to target are added into an aggregate with the metadata 'group_policy' set and equal to 'none' or 'isolate'. We either have to finally move the group_policy extra spec to its own namespace (there's a TODO in the code for that, similar to what has been done to hide_hypervisor_id), or explicitly ignore this key in AggregateInstanceExtraSpecsFilter. example: using the flavor with the following extra spec group_policy='none', resources1:CUSTOM_MIG_1G_5GB='1', resources2:CUSTOM_MIG_1G_5GB='1' and having no aggregates, I get the following error in the scheduler log 2024-01-09 22:00:03.045 1 DEBUG nova.filters [None req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b d7355faecc8c45edb7d7de3837df6fd9 - - default default] Starting with 1 host(s) get_filtered_objects /var/lib/openstack/lib/python3.10/site-packages/nova/filters.py:70 2024-01-09 22:00:03.046 1 DEBUG nova.scheduler.filters.aggregate_instance_extra_specs [None req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b d7355faecc8c45edb7d7de3837df6fd9 - - default default] (kaas-node-f5f4a99c-6783-4b5f-b42a-0d772e1b0b11, kaas-node-f5f4a99c-6783-4b5f-b42a-0d772e1b0b11.kaas-kubernetes-4d64eb64810c48b3b5ac17da6a77eede) ram: 192851MB disk: 658432MB io_ops: 0 instances: 0, allocation_candidates: 6 fails flavor extra_specs requirements. Extra_spec group_policy is not in aggregate. host_passes /var/lib/openstack/lib/python3.10/site-packages/nova/scheduler/filters/aggregate_instance_extra_specs.py:63 2024-01-09 22:00:03.046 1 INFO nova.filters [None req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b d7355faecc8c45edb7d7de3837df6fd9 - - default default] Filter AggregateInstanceExtraSpecsFilter returned 0 hosts To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2048874/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2037102] Re: neutron-ovn-metadata-agent dies on broken namespace
Reviewed: https://review.opendev.org/c/openstack/neutron/+/896251 Committed: https://opendev.org/openstack/neutron/commit/566fea3fed837b0130023303c770aade391d3d61 Submitter: "Zuul (22348)" Branch:master commit 566fea3fed837b0130023303c770aade391d3d61 Author: Felix Huettner Date: Fri Sep 22 16:25:10 2023 +0200 fix netns deletion of broken namespaces normal network namespaces are bind-mounted to files under /var/run/netns. If a process deleting a network namespace gets killed during that operation there is the chance that the bind mount to the netns has been removed, but the file under /var/run/netns still exists. When the neutron-ovn-metadata-agent tries to clean up such network namespaces it first tires to validate that the network namespace is empty. For the cases described above this fails, as this network namespace no longer really exists, but is just a stray file laying around. To fix this we treat network namespaces where we get an `OSError` with errno 22 (Invalid Argument) as empty. The calls to pyroute2 to delete the namespace will then clean up the file. Additionally we add a guard to teardown_datapath to continue even if this fails. failing to remove a datapath is not critical and leaves in the worst case a process and a network namespace running, however previously it would have also prevented the creation of new datapaths which is critical for VM startup. Closes-Bug: #2037102 Change-Id: I7c43812fed5903f98a2e491076c24a8d926a59b4 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2037102 Title: neutron-ovn-metadata-agent dies on broken namespace Status in neutron: Fix Released Bug description: neutron-ovn-metadata-agent uses network namespaces to separate the metadata services for individual networks. For each network it automatically creates or destroys an appropriate namespace. If the metadata agent dies for reasons outside of its control (e.g. a SIGKILL) during the process of namespace destruction a broken namespace can be left over. --- Background on pyroute2 namespace management: Creating a network namespace works by: 1. Forking the process and doing everything in the new child 2. Ensuring /var/run/netns exists 3. Ensuring the file for the network namespace under /var/run/netns exists by creating a new empty file 4. calling `unshare` with `CLONE_NEWNET` to move the process to a new network namespace 5. Creating a bind mount from `/proc/self/ns/net` to the file under /var/run/netns Deleting a network namespace works the other way around (but shorter): 1. Unmounting the previously created bind mount 2. Deleting the file for the network namespace --- If the neutron-ovn-metadata-agent is killed between step 1 and 2 of deleting the network namespace then the namespace file will still be around, but not point to any namespace. When `garbage_collect_namespace` tries to check if the namespace is empty it tries to enter the network namespace to dump all devices in there. This raises an exception as the namespace can no longer be entered. neutron-ovn-metadata-agent then crashes and tries again next time, crashing again. ``` Traceback (most recent call last):, File "/usr/local/bin/neutron-ovn-metadata-agent", line 8, in , sys.exit(main()), File "/usr/local/lib/python3.9/site-packages/neutron/cmd/eventlet/agents/ovn_metadata.py", line 24, in main, metadata_agent.main(), File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata_agent.py", line 41, in main, agt.start(), File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 277, in start, self.sync(), File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 61, in wrapped, return f(*args, **kwargs), File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 349, in sync, self.teardown_datapath(self._get_datapath_name(ns)), File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 400, in teardown_datapath, ip.garbage_collect_namespace(), File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 268, in garbage_collect_namespace, if self.namespace_is_empty():, File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 263, in namespace_is_empty, return not self.get_devices(), File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 180, in get_devices, devices = privileged.get_device_names(self.namespace), File "/usr/local/lib/python3.9/site-packages/neu
[Yahoo-eng-team] [Bug 2046939] Re: [OVN] ``OVNAgentExtensionManager`` is resetting the ``agent_api`` during the initialization
Reviewed: https://review.opendev.org/c/openstack/neutron/+/903943 Committed: https://opendev.org/openstack/neutron/commit/86efc8be9934713ad79b3415b8b5b72bd475e01c Submitter: "Zuul (22348)" Branch:master commit 86efc8be9934713ad79b3415b8b5b72bd475e01c Author: Rodolfo Alonso Hernandez Date: Tue Dec 19 10:57:56 2023 + [OVN] OVN agent extensions correctly consume agent API Now the ``OVNAgentExtension`` class do not clear the agent API during the extension initialization. This patch also passes the agent object to the OVN agent extensions as agent API. Any method required will be implemented directly on the OVN agent class. Closes-Bug: #2046939 Change-Id: Ia635ca1ff97c3db43a34d3dec6a7f9df154dfe28 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2046939 Title: [OVN] ``OVNAgentExtensionManager`` is resetting the ``agent_api`` during the initialization Status in neutron: Fix Released Bug description: The ``OVNAgentExtensionManager`` instance of the OVN agent is resetting the ``agent_api`` member during the extensions manager initialization. The ``OVNAgentExtensionManager`` inherits from ``AgentExtensionsManager``. The ``initialize`` method iterates through the loaded extensions and execute the following methods: * ``consume_api``: assigns the agent API to the extension. * ``initialize``: due to a wrong implementation, this method is now assigning None to the agent API, previously assigned. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2046939/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2048979] [NEW] [ml2][ovs] ports without local vlan tag are processed on openflow security group
Public bug reported: We recently met an issue during VM live migration: 1. nova starts live migration 2. plug ports on new host 3. neutron-ovs-agent starts to process the port, but the port is in 'added' and 'updated' set at the same time. 4. because nova still not activate the destination port binding, so there is no local vlan for this port. Then, ovs-agent met errors: Error while processing VIF ports: OVSFWTagNotFound: Cannot get tag for port tap092f38ed-a7 from its other_config: {} A fix should be added for ``setup_port_filters`` to remove ports of the "binding_no_activated_devices". ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2048979 Title: [ml2][ovs] ports without local vlan tag are processed on openflow security group Status in neutron: New Bug description: We recently met an issue during VM live migration: 1. nova starts live migration 2. plug ports on new host 3. neutron-ovs-agent starts to process the port, but the port is in 'added' and 'updated' set at the same time. 4. because nova still not activate the destination port binding, so there is no local vlan for this port. Then, ovs-agent met errors: Error while processing VIF ports: OVSFWTagNotFound: Cannot get tag for port tap092f38ed-a7 from its other_config: {} A fix should be added for ``setup_port_filters`` to remove ports of the "binding_no_activated_devices". To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2048979/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp