[Yahoo-eng-team] [Bug 2048848] [NEW] get_power_state blocked

2024-01-10 Thread Yalei Li
Public bug reported:

Description
---
When the network for an rbd (RADOS Block Device) storage disconnects due to a 
failure, `get_power_state` becomes blocked when attempting to query the power 
state of a virtual machine. The goal is to check the power status and migrate 
online VMs. However, when the periodic monitoring program `domstats` hangs 
while accessing the disconnected storage, it causes libvirt's rpc-worker to be 
occupied for extended periods. In scenarios with multiple virtual machines, 
querying the power status interface also gets delayed and cannot be executed 
immediately.

Steps to reproduce
--
1. Disconnect the network for the rbd storage.
2. Schedule `domstats` to run every 10 seconds.

Expected result
---
The expected outcome is to switch to a higher-priority interface within 
libvirt, such as using `domain.state()` possibly in conjunction with a priority 
RPC mechanism like `prio-rpc`. This would ensure that critical operations, 
including querying power states and conducting necessary migrations, are 
prioritized and can still be executed promptly even under resource-constrained 
conditions.

** Affects: nova
 Importance: Undecided
 Assignee: Yalei Li (chetaiyong)
 Status: New

** Changed in: nova
 Assignee: (unassigned) => Yalei Li (chetaiyong)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2048848

Title:
  get_power_state blocked

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ---
  When the network for an rbd (RADOS Block Device) storage disconnects due to a 
failure, `get_power_state` becomes blocked when attempting to query the power 
state of a virtual machine. The goal is to check the power status and migrate 
online VMs. However, when the periodic monitoring program `domstats` hangs 
while accessing the disconnected storage, it causes libvirt's rpc-worker to be 
occupied for extended periods. In scenarios with multiple virtual machines, 
querying the power status interface also gets delayed and cannot be executed 
immediately.

  Steps to reproduce
  --
  1. Disconnect the network for the rbd storage.
  2. Schedule `domstats` to run every 10 seconds.

  Expected result
  ---
  The expected outcome is to switch to a higher-priority interface within 
libvirt, such as using `domain.state()` possibly in conjunction with a priority 
RPC mechanism like `prio-rpc`. This would ensure that critical operations, 
including querying power states and conducting necessary migrations, are 
prioritized and can still be executed promptly even under resource-constrained 
conditions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2048848/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2048874] [NEW] group_policy flavor extra spec is not compatible with AggregateInstanceExtraSpecsFilter

2024-01-10 Thread Pavlo Shchelokovskyy
Public bug reported:

Effectively adding this extra spec to use the 'granular resource
request' feature of placement also requires that all the computes such
flavor attempts to target are added into an aggregate with the metadata
'group_policy' set and equal to 'none' or 'isolate'.

We either have to finally move the group_policy extra spec to its own namespace 
(there's a TODO in the code for that, similar to what has been done to 
hide_hypervisor_id),
or explicitly ignore this key in AggregateInstanceExtraSpecsFilter.

example:
using the flavor with the following extra spec

group_policy='none', resources1:CUSTOM_MIG_1G_5GB='1',
resources2:CUSTOM_MIG_1G_5GB='1'

and having no aggregates, I get the following error in the scheduler log

2024-01-09 22:00:03.045 1 DEBUG nova.filters [None 
req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b 
d7355faecc8c45edb7d7de3837df6fd9 - - default default] Starting with 1 host(s) 
get_filtered_objects 
/var/lib/openstack/lib/python3.10/site-packages/nova/filters.py:70
2024-01-09 22:00:03.046 1 DEBUG 
nova.scheduler.filters.aggregate_instance_extra_specs [None 
req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b 
d7355faecc8c45edb7d7de3837df6fd9 - - default default] 
(kaas-node-f5f4a99c-6783-4b5f-b42a-0d772e1b0b11, 
kaas-node-f5f4a99c-6783-4b5f-b42a-0d772e1b0b11.kaas-kubernetes-4d64eb64810c48b3b5ac17da6a77eede)
 ram: 192851MB disk: 658432MB io_ops: 0 instances: 0, allocation_candidates: 6 
fails flavor extra_specs requirements. Extra_spec group_policy is not in 
aggregate. host_passes 
/var/lib/openstack/lib/python3.10/site-packages/nova/scheduler/filters/aggregate_instance_extra_specs.py:63
2024-01-09 22:00:03.046 1 INFO nova.filters [None 
req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b 
d7355faecc8c45edb7d7de3837df6fd9 - - default default] Filter 
AggregateInstanceExtraSpecsFilter returned 0 hosts

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2048874

Title:
  group_policy flavor extra spec is not compatible with
  AggregateInstanceExtraSpecsFilter

Status in OpenStack Compute (nova):
  New

Bug description:
  Effectively adding this extra spec to use the 'granular resource
  request' feature of placement also requires that all the computes such
  flavor attempts to target are added into an aggregate with the
  metadata 'group_policy' set and equal to 'none' or 'isolate'.

  We either have to finally move the group_policy extra spec to its own 
namespace (there's a TODO in the code for that, similar to what has been done 
to hide_hypervisor_id),
  or explicitly ignore this key in AggregateInstanceExtraSpecsFilter.

  example:
  using the flavor with the following extra spec

  group_policy='none', resources1:CUSTOM_MIG_1G_5GB='1',
  resources2:CUSTOM_MIG_1G_5GB='1'

  and having no aggregates, I get the following error in the scheduler
  log

  2024-01-09 22:00:03.045 1 DEBUG nova.filters [None 
req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b 
d7355faecc8c45edb7d7de3837df6fd9 - - default default] Starting with 1 host(s) 
get_filtered_objects 
/var/lib/openstack/lib/python3.10/site-packages/nova/filters.py:70
  2024-01-09 22:00:03.046 1 DEBUG 
nova.scheduler.filters.aggregate_instance_extra_specs [None 
req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b 
d7355faecc8c45edb7d7de3837df6fd9 - - default default] 
(kaas-node-f5f4a99c-6783-4b5f-b42a-0d772e1b0b11, 
kaas-node-f5f4a99c-6783-4b5f-b42a-0d772e1b0b11.kaas-kubernetes-4d64eb64810c48b3b5ac17da6a77eede)
 ram: 192851MB disk: 658432MB io_ops: 0 instances: 0, allocation_candidates: 6 
fails flavor extra_specs requirements. Extra_spec group_policy is not in 
aggregate. host_passes 
/var/lib/openstack/lib/python3.10/site-packages/nova/scheduler/filters/aggregate_instance_extra_specs.py:63
  2024-01-09 22:00:03.046 1 INFO nova.filters [None 
req-db963b93-798c-4289-93f6-52ad7054ac70 8f2db13a6c5c462bbe921c41d2beac3b 
d7355faecc8c45edb7d7de3837df6fd9 - - default default] Filter 
AggregateInstanceExtraSpecsFilter returned 0 hosts

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2048874/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2037102] Re: neutron-ovn-metadata-agent dies on broken namespace

2024-01-10 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/896251
Committed: 
https://opendev.org/openstack/neutron/commit/566fea3fed837b0130023303c770aade391d3d61
Submitter: "Zuul (22348)"
Branch:master

commit 566fea3fed837b0130023303c770aade391d3d61
Author: Felix Huettner 
Date:   Fri Sep 22 16:25:10 2023 +0200

fix netns deletion of broken namespaces

normal network namespaces are bind-mounted to files under
/var/run/netns. If a process deleting a network namespace gets killed
during that operation there is the chance that the bind mount to the
netns has been removed, but the file under /var/run/netns still exists.

When the neutron-ovn-metadata-agent tries to clean up such network
namespaces it first tires to validate that the network namespace is
empty. For the cases described above this fails, as this network
namespace no longer really exists, but is just a stray file laying
around.

To fix this we treat network namespaces where we get an `OSError` with
errno 22 (Invalid Argument) as empty. The calls to pyroute2 to delete
the namespace will then clean up the file.

Additionally we add a guard to teardown_datapath to continue even if
this fails. failing to remove a datapath is not critical and leaves in
the worst case a process and a network namespace running, however
previously it would have also prevented the creation of new datapaths
which is critical for VM startup.

Closes-Bug: #2037102
Change-Id: I7c43812fed5903f98a2e491076c24a8d926a59b4


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2037102

Title:
  neutron-ovn-metadata-agent dies on broken namespace

Status in neutron:
  Fix Released

Bug description:
  neutron-ovn-metadata-agent uses network namespaces to separate the
  metadata services for individual networks. For each network it
  automatically creates or destroys an appropriate namespace.

  If the metadata agent dies for reasons outside of its control (e.g. a
  SIGKILL) during the process of namespace destruction a broken
  namespace can be left over.

  ---
  Background on pyroute2 namespace management:

  Creating a network namespace works by:
  1. Forking the process and doing everything in the new child
  2. Ensuring /var/run/netns exists
  3. Ensuring the file for the network namespace under /var/run/netns exists by 
creating a new empty file
  4. calling `unshare` with `CLONE_NEWNET` to move the process to a new network 
namespace
  5. Creating a bind mount from `/proc/self/ns/net` to the file under 
/var/run/netns

  Deleting a network namespace works the other way around (but shorter):
  1. Unmounting the previously created bind mount
  2. Deleting the file for the network namespace

  ---

  If the neutron-ovn-metadata-agent is killed between step 1 and 2 of
  deleting the network namespace then the namespace file will still be
  around, but not point to any namespace.

  When `garbage_collect_namespace` tries to check if the namespace is empty it 
tries to enter the network namespace to dump all devices in there. This raises 
an exception as the namespace can no longer be entered.
  neutron-ovn-metadata-agent then crashes and tries again next time, crashing 
again.

  
  ```
  Traceback (most recent call last):,
 File "/usr/local/bin/neutron-ovn-metadata-agent", line 8, in ,
   sys.exit(main()),
 File 
"/usr/local/lib/python3.9/site-packages/neutron/cmd/eventlet/agents/ovn_metadata.py",
 line 24, in main,
   metadata_agent.main(),
 File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata_agent.py", 
line 41, in main,
   agt.start(),
 File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 277, in start,
   self.sync(),
 File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 61, in wrapped,
   return f(*args, **kwargs),
 File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 349, in sync,
   self.teardown_datapath(self._get_datapath_name(ns)),
 File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", 
line 400, in teardown_datapath,
   ip.garbage_collect_namespace(),
 File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 
268, in garbage_collect_namespace,
   if self.namespace_is_empty():,
 File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 
263, in namespace_is_empty,
   return not self.get_devices(),
 File 
"/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 
180, in get_devices,
   devices = privileged.get_device_names(self.namespace),
 File 
"/usr/local/lib/python3.9/site-packages/neu

[Yahoo-eng-team] [Bug 2046939] Re: [OVN] ``OVNAgentExtensionManager`` is resetting the ``agent_api`` during the initialization

2024-01-10 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/903943
Committed: 
https://opendev.org/openstack/neutron/commit/86efc8be9934713ad79b3415b8b5b72bd475e01c
Submitter: "Zuul (22348)"
Branch:master

commit 86efc8be9934713ad79b3415b8b5b72bd475e01c
Author: Rodolfo Alonso Hernandez 
Date:   Tue Dec 19 10:57:56 2023 +

[OVN] OVN agent extensions correctly consume agent API

Now the ``OVNAgentExtension`` class do not clear the agent API during
the extension initialization.

This patch also passes the agent object to the OVN agent extensions
as agent API. Any method required will be implemented directly on the
OVN agent class.

Closes-Bug: #2046939
Change-Id: Ia635ca1ff97c3db43a34d3dec6a7f9df154dfe28


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2046939

Title:
  [OVN] ``OVNAgentExtensionManager`` is resetting the ``agent_api``
  during the  initialization

Status in neutron:
  Fix Released

Bug description:
  The ``OVNAgentExtensionManager`` instance of the OVN agent is resetting the 
``agent_api`` member during the extensions manager initialization. The 
``OVNAgentExtensionManager`` inherits from ``AgentExtensionsManager``. The 
``initialize`` method iterates through the loaded extensions and execute the 
following methods:
  * ``consume_api``: assigns the agent API to the extension.
  * ``initialize``: due to a wrong implementation, this method is now assigning 
None to the agent API, previously assigned.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2046939/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2048979] [NEW] [ml2][ovs] ports without local vlan tag are processed on openflow security group

2024-01-10 Thread LIU Yulong
Public bug reported:

We recently met an issue during VM live migration:
1. nova starts live migration
2. plug ports on new host
3. neutron-ovs-agent starts to process the port, but the port is in 'added' and 
'updated' set at the same time.
4. because nova still not activate the destination port binding, so there is no 
local vlan for this port.

Then, ovs-agent met errors:
Error while processing VIF ports: OVSFWTagNotFound: Cannot get tag for port 
tap092f38ed-a7 from its other_config: {}


A fix should be added for ``setup_port_filters`` to remove ports of the 
"binding_no_activated_devices".

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2048979

Title:
  [ml2][ovs] ports without local vlan tag are processed on openflow
  security group

Status in neutron:
  New

Bug description:
  We recently met an issue during VM live migration:
  1. nova starts live migration
  2. plug ports on new host
  3. neutron-ovs-agent starts to process the port, but the port is in 'added' 
and 'updated' set at the same time.
  4. because nova still not activate the destination port binding, so there is 
no local vlan for this port.

  Then, ovs-agent met errors:
  Error while processing VIF ports: OVSFWTagNotFound: Cannot get tag for port 
tap092f38ed-a7 from its other_config: {}

  
  A fix should be added for ``setup_port_filters`` to remove ports of the 
"binding_no_activated_devices".

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2048979/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp