[Yahoo-eng-team] [Bug 2083226] [NEW] [scale] Adding a public external network to a router is killing database
Public bug reported: Context === OpenStack Bocat (but master seems affected by this as well). OVS based deployment. L3 routers in DVR and HA mode. One big public "external/public" network (with subnets like /21 or /22) used by instances and router external gateways. Problem description === When adding a port on a router in HA+DVR, neutron api may send a lot of RPC messages toward L3 agents, depending on the size of the subnet used for the gateway. How to reproduce Add a port on a router: $ openstack port create --network public pub $ openstack router add port router-arnaud pub On neutron server, in logs (in DEBUG): Notify agent at l3_agent.hostxyz We see this line for all l3 agents having a port in public network/subnet (which can be huge, like 1k). Then, all agents are doing another RPC call (sync_routers) which is ending on neutron-rpc with this log line: Sync routers for ids [abc] Behind the Sync router, some big SQL request are done (e.g. in l3_dvrscheduler_db.py / _get_dvr_subnet_ids_on_host_query) When 1k requests like this are done, on each router update, the database is killed by too much SQL requests to do. The dvr router is then configured by l3 agent on all the computes, but this is never used (the public network is an external one and does not rely on routers to be accessible). We have two options: - prevent adding a port from an external network inside a router (it should be used only for routers gateway), or - stop flooding the creation of dvr routers in such situation Note, this is pretty much the same scenario as the one described in #1992950 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2083226 Title: [scale] Adding a public external network to a router is killing database Status in neutron: New Bug description: Context === OpenStack Bocat (but master seems affected by this as well). OVS based deployment. L3 routers in DVR and HA mode. One big public "external/public" network (with subnets like /21 or /22) used by instances and router external gateways. Problem description === When adding a port on a router in HA+DVR, neutron api may send a lot of RPC messages toward L3 agents, depending on the size of the subnet used for the gateway. How to reproduce Add a port on a router: $ openstack port create --network public pub $ openstack router add port router-arnaud pub On neutron server, in logs (in DEBUG): Notify agent at l3_agent.hostxyz We see this line for all l3 agents having a port in public network/subnet (which can be huge, like 1k). Then, all agents are doing another RPC call (sync_routers) which is ending on neutron-rpc with this log line: Sync routers for ids [abc] Behind the Sync router, some big SQL request are done (e.g. in l3_dvrscheduler_db.py / _get_dvr_subnet_ids_on_host_query) When 1k requests like this are done, on each router update, the database is killed by too much SQL requests to do. The dvr router is then configured by l3 agent on all the computes, but this is never used (the public network is an external one and does not rely on routers to be accessible). We have two options: - prevent adding a port from an external network inside a router (it should be used only for routers gateway), or - stop flooding the creation of dvr routers in such situation Note, this is pretty much the same scenario as the one described in #1992950 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2083226/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2083227] [NEW] [neutron-lib] pep8 job failing with pylint=3.3.1
Public bug reported: pylint==3.3.1 was released on Set 24, 2024 [1]. Prior to this, neutron- lib pep8 was using pylint=3.2.7. Current output (failing checks): * Module neutron_lib.context neutron_lib/context.py:36:4: R0917: Too many positional arguments (8/5) (too-many-positional-arguments) * Module neutron_lib.placement.client neutron_lib/placement/client.py:337:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) * Module neutron_lib.callbacks.manager neutron_lib/callbacks/manager.py:36:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) * Module neutron_lib.callbacks.events neutron_lib/callbacks/events.py:73:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) neutron_lib/callbacks/events.py:114:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments) neutron_lib/callbacks/events.py:154:4: R0917: Too many positional arguments (9/5) (too-many-positional-arguments) * Module neutron_lib.db.model_query neutron_lib/db/model_query.py:74:0: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) neutron_lib/db/model_query.py:302:0: R0917: Too many positional arguments (9/5) (too-many-positional-arguments) neutron_lib/db/model_query.py:350:0: R0917: Too many positional arguments (10/5) (too-many-positional-arguments) * Module neutron_lib.services.qos.base neutron_lib/services/qos/base.py:30:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) * Module neutron_lib.agent.linux.interface neutron_lib/agent/linux/interface.py:24:4: R0917: Too many positional arguments (10/5) (too-many-positional-arguments) [1]https://pypi.org/project/pylint/#history ** Affects: neutron Importance: High Assignee: Rodolfo Alonso (rodolfo-alonso-hernandez) Status: New ** Changed in: neutron Assignee: (unassigned) => Rodolfo Alonso (rodolfo-alonso-hernandez) ** Changed in: neutron Importance: Undecided => High -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2083227 Title: [neutron-lib] pep8 job failing with pylint=3.3.1 Status in neutron: New Bug description: pylint==3.3.1 was released on Set 24, 2024 [1]. Prior to this, neutron-lib pep8 was using pylint=3.2.7. Current output (failing checks): * Module neutron_lib.context neutron_lib/context.py:36:4: R0917: Too many positional arguments (8/5) (too-many-positional-arguments) * Module neutron_lib.placement.client neutron_lib/placement/client.py:337:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) * Module neutron_lib.callbacks.manager neutron_lib/callbacks/manager.py:36:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) * Module neutron_lib.callbacks.events neutron_lib/callbacks/events.py:73:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) neutron_lib/callbacks/events.py:114:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments) neutron_lib/callbacks/events.py:154:4: R0917: Too many positional arguments (9/5) (too-many-positional-arguments) * Module neutron_lib.db.model_query neutron_lib/db/model_query.py:74:0: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) neutron_lib/db/model_query.py:302:0: R0917: Too many positional arguments (9/5) (too-many-positional-arguments) neutron_lib/db/model_query.py:350:0: R0917: Too many positional arguments (10/5) (too-many-positional-arguments) * Module neutron_lib.services.qos.base neutron_lib/services/qos/base.py:30:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) * Module neutron_lib.agent.linux.interface neutron_lib/agent/linux/interface.py:24:4: R0917: Too many positional arguments (10/5) (too-many-positional-arguments) [1]https://pypi.org/project/pylint/#history To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2083227/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2083214] [NEW] [RFE] control random-fully behavior on a per-FIP base
Public bug reported: As of this moment, Neutron uses random-fully[1] PAT when performing NAT on L3 agent, meaning that source port is randomized for every outgoing connection. This breaks some workflows for software that performs UDP hole punching, such as Nebula[2] for example. There're other use cases where knowing post-NAT source port value for an outgoing connection is desirable. Neutron currently provides a `use_random_fully` setting[3] that controls use of random-fully PAT, but it's global and affects cloud as a whole. My proposal is to implement control over random-fully setting on a per- Floating-IP basis. I've already implemented this in a Devstack environment. My change required updates in neutron(L3 agent code, database schema update(1 additional column of a boolean type for floatingip table)), neutron-lib(API support, introduction of a new validator type, etc), and openstackclient(CLI support). In this implementation a new 'random_fully' setting belonging to a FIP can take 3 values: True, False, or None. If True(API JSON: {"floatingip": {"random_fully": true}}), random-fully is always enabled on a FIP, disregarding global `use_random_fully` setting. If False(API JSON: {"floatingip": {"random_fully": false}}), random-fully is always disabled on a FIP, disregarding global `use_random_fully` setting. If None(API JSON: {"floatingip": {"random_fully": null}}), random-fully mode is inherited from the global `use_random_fully` setting. It works pretty much as expected, L3 agent updates iptables rules after API call. I'll be glad to share that code to expedite this feature implementation. Short example output from a Devstack environment: ``` stack@vlab007:~/neutron$ openstack floating ip list --long -c ID -c 'Floating IP Address' -c 'Fixed IP Address' -c Port -c Router -c Status -c Description -c 'Random Fully' +--+-+--+--+--++--+--+ | ID | Floating IP Address | Fixed IP Address | Port | Router | Status | Description | Random Fully | +--+-+--+--+--++--+--+ | 0d97ed4c-15ae-4d01-a69c-ffd14e46ead0 | 172.24.4.11 | 10.0.0.21 | b5b29b90-350c-4d4e-8e27-35e76e9b8204 | 90364e18-a104-49b0-bbb5-41a516ea9bd2 | ACTIVE | My FIP description 4 | None | | 387fdc61-d386-4917-bd82-23055ebca273 | 172.24.4.207| 10.0.0.39 | 64413e38-d611-461d-b1e5-20e38d3795dd | 90364e18-a104-49b0-bbb5-41a516ea9bd2 | ACTIVE | | None | | b47db56a-f944-43c2-ab16-271d3d809e20 | 172.24.4.231| 10.0.0.19 | 97acadf3-7ed2-4dee-8e9c-db3b359c2319 | 90364e18-a104-49b0-bbb5-41a516ea9bd2 | ACTIVE | | False| +--+-+--+--+--++--+--+ ubuntu@vlab007:~$ sudo ip netns exec qrouter-90364e18-a104-49b0-bbb5-41a516ea9bd2 iptables-legacy-save -t nat|grep "neutron-l3-agent-float-snat -s" -A neutron-l3-agent-float-snat -s 10.0.0.21/32 -j SNAT --to-source 172.24.4.11 --random-fully -A neutron-l3-agent-float-snat -s 10.0.0.39/32 -j SNAT --to-source 172.24.4.207 --random-fully -A neutron-l3-agent-float-snat -s 10.0.0.19/32 -j SNAT --to-source 172.24.4.231 stack@vlab007:~/neutron$ openstack floating ip set --disable-random-fully 387fdc61-d386-4917-bd82-23055ebca273 stack@vlab007:~/neutron$ openstack floating ip list --long -c ID -c 'Floating IP Address' -c 'Fixed IP Address' -c Port -c Router -c Status -c Description -c 'Random Fully' +--+-+--+--+--++--+--+ | ID | Floating IP Address | Fixed IP Address | Port | Router | Status | Description | Random Fully | +--+-+--+--+--++--+--+ | 0d97ed4c-15ae-4d01-a69c-ffd14e46ead0 | 172.24.4.11 | 10.0.0.21 | b5b29b90-350c-4d4e-8e27-35e76e9b8204 | 90364e18-a104-49b0-bbb5-41a516ea9bd2 | ACTIVE | My FIP description 4 | None | | 387fdc61-d386-4917-bd82-23055ebca273 | 172.24.4.207| 10.0.0.39 | 64413e38-d611-461d-b1e5-20e38d3795dd | 90364e18-a104
[Yahoo-eng-team] [Bug 2083246] [NEW] nova-compute can be overloaded with incoming evacuations
Public bug reported: Today the nova-compute service does not limit the number of concurrent evacuation requests. The [compute]max_concurrent_build config only considered for new VM builds but not for re-builds due to incoming evacuation. If the evacuated VMs are on shared storage then there is no heavy IO operations (no image download / convert) then the [compute]max_concurrent_disk_ops config could not prevent the overload either. At some point nova-compute will start failing to process incoming vif- plugged events in a timely manner. (In a specific env it happened after more than 60 concurrent evacuation requests targeting the same node) The [compute]max_concurrent_build config description does not explicitly states that it is only counts new build requests and ignores rebuilds. So I consider it a bug in nova-compute that it is not limiting all the builds by that config option. As this config option is defaulted to 10 it shows that nova never planned to really support significantly more than 10 concurrent so failing at 60 concurrent evacuation does not need t be supported. ** Affects: nova Importance: Undecided Status: New ** Tags: compute evacuate ** Tags added: evacuate ** Tags added: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2083246 Title: nova-compute can be overloaded with incoming evacuations Status in OpenStack Compute (nova): New Bug description: Today the nova-compute service does not limit the number of concurrent evacuation requests. The [compute]max_concurrent_build config only considered for new VM builds but not for re-builds due to incoming evacuation. If the evacuated VMs are on shared storage then there is no heavy IO operations (no image download / convert) then the [compute]max_concurrent_disk_ops config could not prevent the overload either. At some point nova-compute will start failing to process incoming vif- plugged events in a timely manner. (In a specific env it happened after more than 60 concurrent evacuation requests targeting the same node) The [compute]max_concurrent_build config description does not explicitly states that it is only counts new build requests and ignores rebuilds. So I consider it a bug in nova-compute that it is not limiting all the builds by that config option. As this config option is defaulted to 10 it shows that nova never planned to really support significantly more than 10 concurrent so failing at 60 concurrent evacuation does not need t be supported. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2083246/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1999814] Re: [SRU] Allow for specifying common baseline CPU model with disabled feature
This bug was fixed in the package nova - 3:25.2.1-0ubuntu2.7~cloud0 --- nova (3:25.2.1-0ubuntu2.7~cloud0) focal-yoga; urgency=medium . * New update for the Ubuntu Cloud Archive. . nova (3:25.2.1-0ubuntu2.7) jammy; urgency=medium . [ Chengen Du ] * d/p/lp2024258-database-Archive-parent-and-child-rows-trees-one-at-.patch: Performance degradation archiving DB with large numbers of FK related records (LP: #2024258) . [ Rodrigo Barbieri ] * d/p/lp1999814.patch: Rework CPU comparison at startup and add ability to skip it. Addresses CascadeLake incompatibility with IceLake. (LP: #1999814) ** Changed in: cloud-archive/yoga Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1999814 Title: [SRU] Allow for specifying common baseline CPU model with disabled feature Status in Ubuntu Cloud Archive: Invalid Status in Ubuntu Cloud Archive ussuri series: New Status in Ubuntu Cloud Archive yoga series: Fix Released Status in OpenStack Compute (nova): Expired Status in OpenStack Compute (nova) ussuri series: New Status in OpenStack Compute (nova) victoria series: Won't Fix Status in OpenStack Compute (nova) wallaby series: Won't Fix Status in OpenStack Compute (nova) xena series: Won't Fix Status in OpenStack Compute (nova) yoga series: New Status in nova package in Ubuntu: Fix Released Status in nova source package in Bionic: Won't Fix Status in nova source package in Focal: Fix Released Status in nova source package in Jammy: Fix Released Bug description: SRU TEMPLATE AT THE BOTTOM *** Hello, This is very similar to pad.lv/1852437 (and the related blueprint at https://blueprints.launchpad.net/nova/+spec/allow-disabling-cpu- flags), but there is a very different and important nuance. A customer I'm working with has two classes of blades that they're trying to use. Their existing ones are Cascade Lake-based; they are presently using the Cascadelake-Server-noTSX CPU model via libvirt.cpu_model in nova.conf. Their new blades are Ice Lake-based, which is a newer processor, which typically would also be able to run based on the Cascade Lake feature set - except that these Ice Lake processors lack the MPX feature defined in the Cascadelake-Server- noTSX model. The result of this is evident when I try to start nova on the new blades with the Ice Lake CPUs. Even if I specify the following in my nova.conf: [libvirt] cpu_mode = custom cpu_model = Cascadelake-Server-noTSX cpu_model_extra_flags = -mpx That is not enough to allow Nova to start; it fails in the libvirt driver in the _check_cpu_compatibility function: 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service Traceback (most recent call last): 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 771, in _check_cpu_compatibility 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service self._compare_cpu(cpu, self._get_cpu_info(), None) 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 8817, in _compare_cpu 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service raise exception.InvalidCPUInfo(reason=m % {'ret': ret, 'u': u}) 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service nova.exception.InvalidCPUInfo: Unacceptable CPU info: CPU doesn't have compatibility. 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service 0 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service During handling of the above exception, another exception occurred: 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service Traceback (most recent call last): 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/oslo_service/service.py", line 810, in run_service 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service service.start() 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/service.py", line 173, in start 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service self.manager.init_host() 2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 1404, in init_host 2022-12-15 17:20:59.562 183670
[Yahoo-eng-team] [Bug 2024258] Re: Performance degradation archiving DB with large numbers of FK related records
This bug was fixed in the package nova - 3:25.2.1-0ubuntu2.7~cloud0 --- nova (3:25.2.1-0ubuntu2.7~cloud0) focal-yoga; urgency=medium . * New update for the Ubuntu Cloud Archive. . nova (3:25.2.1-0ubuntu2.7) jammy; urgency=medium . [ Chengen Du ] * d/p/lp2024258-database-Archive-parent-and-child-rows-trees-one-at-.patch: Performance degradation archiving DB with large numbers of FK related records (LP: #2024258) . [ Rodrigo Barbieri ] * d/p/lp1999814.patch: Rework CPU comparison at startup and add ability to skip it. Addresses CascadeLake incompatibility with IceLake. (LP: #1999814) ** Changed in: cloud-archive/yoga Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2024258 Title: Performance degradation archiving DB with large numbers of FK related records Status in Ubuntu Cloud Archive: Invalid Status in Ubuntu Cloud Archive ussuri series: Fix Committed Status in Ubuntu Cloud Archive yoga series: Fix Released Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) antelope series: In Progress Status in OpenStack Compute (nova) wallaby series: In Progress Status in OpenStack Compute (nova) xena series: In Progress Status in OpenStack Compute (nova) yoga series: In Progress Status in OpenStack Compute (nova) zed series: In Progress Status in nova package in Ubuntu: Won't Fix Status in nova source package in Focal: Fix Released Status in nova source package in Jammy: Fix Released Bug description: [Impact] Originally, Nova archives deleted rows in batches consisting of a maximum number of parent rows (max_rows) plus their child rows, all within a single database transaction. This approach limits the maximum value of max_rows that can be specified by the caller due to the potential size of the database transaction it could generate. Additionally, this behavior can cause the cleanup process to frequently encounter the following error: oslo_db.exception.DBError: (pymysql.err.InternalError) (3100, "Error on observer while running replication hook 'before_commit'.") The error arises when the transaction exceeds the group replication transaction size limit, a safeguard implemented to prevent potential MySQL crashes [1]. The default value for this limit is approximately 143MB. [Fix] An upstream commit has changed the logic to archive one parent row and its related child rows in a single database transaction. This change allows operators to choose more predictable values for max_rows and achieve more progress with each invocation of archive_deleted_rows. Additionally, this commit reduces the chances of encountering the issue where the transaction size exceeds the group replication transaction size limit. commit 697fa3c000696da559e52b664c04cbd8d261c037 Author: melanie witt CommitDate: Tue Jun 20 20:04:46 2023 + database: Archive parent and child rows "trees" one at a time [Test Plan] 1. Create an instance and delete it in OpenStack. 2. Log in to the Nova database and confirm that there is an entry with a deleted_at value that is not NULL. select display_name, deleted_at from instances where deleted_at <> 0; 3. Execute the following command, ensuring that the timestamp specified in --before is later than the deleted_at value: nova-manage db archive_deleted_rows --before "XXX-XX-XX XX:XX:XX" --verbose --until-complete 4. Log in to the Nova database again and confirm that the entry has been archived and removed. select display_name, deleted_at from instances where deleted_at <> 0; [Where problems could occur] The commit changes the logic for archiving deleted entries to reduce the size of transactions generated during the operation. If the patch contains errors, it will only impact the archiving of deleted entries and will not affect other functionalities. [1] https://bugs.mysql.com/bug.php?id=84785 [Original Bug Description] Observed downstream in a large scale cluster with constant create/delete server activity and hundreds of thousands of deleted instances rows. Currently, we archive deleted rows in batches of max_rows parents + their child rows in a single database transaction. Doing it that way limits how high a value of max_rows can be specified by the caller because of the size of the database transaction it could generate. For example, in a large scale deployment with hundreds of thousands of deleted rows and constant server creation and deletion activity, a value of max_rows=1000 might exceed the database's configured maximum packet size or timeout due to a database deadlock, forcing the operator to use a much lower max_rows value like 100 or 50. And when the operator has e.g. 500,000 deleted instances rows (and
[Yahoo-eng-team] [Bug 2083237] [NEW] Initial router state is not set correctly
Public bug reported: Context === OpenStack Antelope (but master seems affected). When a router is created in HA mode, multiple L3 agents (3 by default) are spawning a keepalived process to monitor the state of the router. The initial state of the router is supposed to be saved in the 'initial_state' variable when a call to the initial_state_change() function is done. This initial_state is kept so that it prevent false bounces when keepalived is transiting. Problem === The initial_state is set only when the state of the router is primary. So in a scenario with 3 L3 agents, we could have: t0: agent-1 initial state: primary agent-2 initial state: (unset) agent-3 initial state: (unset) t1: agent-1 failure agent-2 transition to primary agent-3 transition to primary both agent-2 and 3 are transitionning to primary and neutron will send a port binding update to server. The last one sending the request will win the binding. Let's imagine the binding is now on agent-3 t2: agent-1 failure agent-2 primary agent-3 transition to backup agent-2 wins and stay primary, agent-3 transition to backup. So now, we have the port binding recorded to be on agent-3 but agent-2 is actually primary. Solution Neutron code is supposed to handle false bounces by setting the initial state correctly. Then the code will sleep (eventlet.sleep(self.conf.ha_vrrp_advert_int)) until the keepalived state is stabilized. So only one agent will grab the binding. To make sure this code works, the initial state needs to be set correctly from the beginning. ** Affects: neutron Importance: Undecided Assignee: Arnaud Morin (arnaud-morin) Status: In Progress ** Changed in: neutron Assignee: (unassigned) => Arnaud Morin (arnaud-morin) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2083237 Title: Initial router state is not set correctly Status in neutron: In Progress Bug description: Context === OpenStack Antelope (but master seems affected). When a router is created in HA mode, multiple L3 agents (3 by default) are spawning a keepalived process to monitor the state of the router. The initial state of the router is supposed to be saved in the 'initial_state' variable when a call to the initial_state_change() function is done. This initial_state is kept so that it prevent false bounces when keepalived is transiting. Problem === The initial_state is set only when the state of the router is primary. So in a scenario with 3 L3 agents, we could have: t0: agent-1 initial state: primary agent-2 initial state: (unset) agent-3 initial state: (unset) t1: agent-1 failure agent-2 transition to primary agent-3 transition to primary both agent-2 and 3 are transitionning to primary and neutron will send a port binding update to server. The last one sending the request will win the binding. Let's imagine the binding is now on agent-3 t2: agent-1 failure agent-2 primary agent-3 transition to backup agent-2 wins and stay primary, agent-3 transition to backup. So now, we have the port binding recorded to be on agent-3 but agent-2 is actually primary. Solution Neutron code is supposed to handle false bounces by setting the initial state correctly. Then the code will sleep (eventlet.sleep(self.conf.ha_vrrp_advert_int)) until the keepalived state is stabilized. So only one agent will grab the binding. To make sure this code works, the initial state needs to be set correctly from the beginning. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2083237/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2024258] Re: Performance degradation archiving DB with large numbers of FK related records
** Changed in: nova/zed Status: In Progress => Won't Fix ** Changed in: nova/yoga Status: In Progress => Won't Fix ** Changed in: nova/xena Status: In Progress => Won't Fix ** Changed in: nova/wallaby Status: In Progress => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2024258 Title: Performance degradation archiving DB with large numbers of FK related records Status in Ubuntu Cloud Archive: Invalid Status in Ubuntu Cloud Archive ussuri series: Fix Committed Status in Ubuntu Cloud Archive yoga series: Fix Released Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) antelope series: In Progress Status in OpenStack Compute (nova) wallaby series: Won't Fix Status in OpenStack Compute (nova) xena series: Won't Fix Status in OpenStack Compute (nova) yoga series: Won't Fix Status in OpenStack Compute (nova) zed series: Won't Fix Status in nova package in Ubuntu: Won't Fix Status in nova source package in Focal: Fix Released Status in nova source package in Jammy: Fix Released Bug description: [Impact] Originally, Nova archives deleted rows in batches consisting of a maximum number of parent rows (max_rows) plus their child rows, all within a single database transaction. This approach limits the maximum value of max_rows that can be specified by the caller due to the potential size of the database transaction it could generate. Additionally, this behavior can cause the cleanup process to frequently encounter the following error: oslo_db.exception.DBError: (pymysql.err.InternalError) (3100, "Error on observer while running replication hook 'before_commit'.") The error arises when the transaction exceeds the group replication transaction size limit, a safeguard implemented to prevent potential MySQL crashes [1]. The default value for this limit is approximately 143MB. [Fix] An upstream commit has changed the logic to archive one parent row and its related child rows in a single database transaction. This change allows operators to choose more predictable values for max_rows and achieve more progress with each invocation of archive_deleted_rows. Additionally, this commit reduces the chances of encountering the issue where the transaction size exceeds the group replication transaction size limit. commit 697fa3c000696da559e52b664c04cbd8d261c037 Author: melanie witt CommitDate: Tue Jun 20 20:04:46 2023 + database: Archive parent and child rows "trees" one at a time [Test Plan] 1. Create an instance and delete it in OpenStack. 2. Log in to the Nova database and confirm that there is an entry with a deleted_at value that is not NULL. select display_name, deleted_at from instances where deleted_at <> 0; 3. Execute the following command, ensuring that the timestamp specified in --before is later than the deleted_at value: nova-manage db archive_deleted_rows --before "XXX-XX-XX XX:XX:XX" --verbose --until-complete 4. Log in to the Nova database again and confirm that the entry has been archived and removed. select display_name, deleted_at from instances where deleted_at <> 0; [Where problems could occur] The commit changes the logic for archiving deleted entries to reduce the size of transactions generated during the operation. If the patch contains errors, it will only impact the archiving of deleted entries and will not affect other functionalities. [1] https://bugs.mysql.com/bug.php?id=84785 [Original Bug Description] Observed downstream in a large scale cluster with constant create/delete server activity and hundreds of thousands of deleted instances rows. Currently, we archive deleted rows in batches of max_rows parents + their child rows in a single database transaction. Doing it that way limits how high a value of max_rows can be specified by the caller because of the size of the database transaction it could generate. For example, in a large scale deployment with hundreds of thousands of deleted rows and constant server creation and deletion activity, a value of max_rows=1000 might exceed the database's configured maximum packet size or timeout due to a database deadlock, forcing the operator to use a much lower max_rows value like 100 or 50. And when the operator has e.g. 500,000 deleted instances rows (and millions of deleted rows total) they are trying to archive, being forced to use a max_rows value several orders of magnitude lower than the number of rows they need to archive is a poor user experience and makes it unclear if archive progress is actually being made. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/2024258/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-
[Yahoo-eng-team] [Bug 2083250] [NEW] This handler is supposed to handle AFTER events, as in 'AFTER it's committed', not BEFORE. Offending resource event: port, after_delete.
Public bug reported: Log: https://zuul.opendev.org/t/openstack/build/93dc7212664a4eb493adb39270bda463 Snippet: https://paste.opendev.org/show/b5tC32IFLb9Bk9RgK4Y9/ ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2083250 Title: This handler is supposed to handle AFTER events, as in 'AFTER it's committed', not BEFORE. Offending resource event: port, after_delete. Status in neutron: New Bug description: Log: https://zuul.opendev.org/t/openstack/build/93dc7212664a4eb493adb39270bda463 Snippet: https://paste.opendev.org/show/b5tC32IFLb9Bk9RgK4Y9/ To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2083250/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2083227] Re: [neutron-lib] pep8 job failing with pylint=3.3.1
Reviewed: https://review.opendev.org/c/openstack/neutron-lib/+/930886 Committed: https://opendev.org/openstack/neutron-lib/commit/939839cb838db12c589f6fe54bc5907cb6303590 Submitter: "Zuul (22348)" Branch:master commit 939839cb838db12c589f6fe54bc5907cb6303590 Author: Rodolfo Alonso Hernandez Date: Mon Sep 30 09:57:24 2024 + Skip pylint recommendation "too-many-positional-arguments" This warning was introduced in [1] as is present in pytlint==3.3.0 [1]https://github.com/pylint-dev/pylint/commit/de6e6fae34cccd2e7587a46450c833258e3000cb Closes-Bug: #2083227 Change-Id: I124d5ff7d34dd868dd2861b72e55d62190dcc3f7 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2083227 Title: [neutron-lib] pep8 job failing with pylint=3.3.1 Status in neutron: Fix Released Bug description: pylint==3.3.1 was released on Set 24, 2024 [1]. Prior to this, neutron-lib pep8 was using pylint=3.2.7. Current output (failing checks): * Module neutron_lib.context neutron_lib/context.py:36:4: R0917: Too many positional arguments (8/5) (too-many-positional-arguments) * Module neutron_lib.placement.client neutron_lib/placement/client.py:337:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) * Module neutron_lib.callbacks.manager neutron_lib/callbacks/manager.py:36:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) * Module neutron_lib.callbacks.events neutron_lib/callbacks/events.py:73:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) neutron_lib/callbacks/events.py:114:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments) neutron_lib/callbacks/events.py:154:4: R0917: Too many positional arguments (9/5) (too-many-positional-arguments) * Module neutron_lib.db.model_query neutron_lib/db/model_query.py:74:0: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) neutron_lib/db/model_query.py:302:0: R0917: Too many positional arguments (9/5) (too-many-positional-arguments) neutron_lib/db/model_query.py:350:0: R0917: Too many positional arguments (10/5) (too-many-positional-arguments) * Module neutron_lib.services.qos.base neutron_lib/services/qos/base.py:30:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments) * Module neutron_lib.agent.linux.interface neutron_lib/agent/linux/interface.py:24:4: R0917: Too many positional arguments (10/5) (too-many-positional-arguments) [1]https://pypi.org/project/pylint/#history To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2083227/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2081859] Re: Nova not initializing os-brick
Reviewed: https://review.opendev.org/c/openstack/nova/+/849328 Committed: https://opendev.org/openstack/nova/commit/8c1a47c9cf6e1001fbefd6ff3b76314e39c81d71 Submitter: "Zuul (22348)" Branch:master commit 8c1a47c9cf6e1001fbefd6ff3b76314e39c81d71 Author: Gorka Eguileor Date: Thu Jul 7 16:22:42 2022 +0200 Support os-brick specific lock_path Note: Initially this patch was related to new feature, but now it has become a bug since os-brick's `setup` method is not being called and it can create problems if os-brick changes. As a new feature, os-brick now supports setting the location of file locks in a different location from the locks of the service. The functionality is intended for HCI deployments and hosts that are running Cinder and Glance using Cinder backend. In those scenarios the service can use a service specific location for its file locks while only sharing the location of os-brick with the other services. To leverage this functionality the new os-brick code is needed and method ``os_brick.setup`` needs to be called once the service configuration options have been loaded. The default value of the os-brick ``lock_path`` is the one set in ``oslo_concurrency``. This patch adds support for this new feature in a non backward compatible way, so it requires an os-brick version bump in the requirements. The patch also ensures that ``tox -egenconfig`` includes the os-brick configuration options when generating the sample config. Closes-Bug: #2081859 Change-Id: I1b81eb65bd145869e8cf6f3aabc6ade58f832a19 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2081859 Title: Nova not initializing os-brick Status in OpenStack Compute (nova): Fix Released Bug description: In the Zed release os-brick started needing to be initialized by calling a `setup` method before the library could be used. At that time there was only 1 feature that depended on it and it was possible to introduce a failsafe for that instance so things wouldn't break. In the Antelope release that failsafe should have been removed from os-brick and all projects should have been calling the `setup` method. Currently nova is not initializing os-brick, so if os-brick removes the failsafe the behavior in os-brick locks will break backward compatibility. Related os-brick patch: https://review.opendev.org/c/openstack/os- brick/+/849324 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2081859/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2081643] Re: Neutron OVN support for CARP
I am going to close this as any discussion would first have to happen on the OVN mailing list. If after that there is any change required in Neutron you can open an RFE for that. ** Changed in: neutron Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2081643 Title: Neutron OVN support for CARP Status in neutron: Invalid Bug description: Does Neutron ML2/OVN support CARP as a Virtual IP synchronization protocol ? (Without disabling Port Security) I’ve been trying to make it work but from what I did managed to understand. CARP uses a mac address with following format 00-00-5E-00-01-{VRID} It answers the ARP Request for Virtual Addresses with source mac of the Main Interface but as arp.sha it uses the mac address mentioned above and from what I could read on OVN source code it doesn’t seems like OVN matches any arp response with different eth.source and arp.sha fields. link to OVN code: https://github.com/ovn-org/ovn/blob/16836c3796f7af68437f9f834b40d87c801dc27c/controller/lflow.c#L2707 https://datatracker.ietf.org/doc/html/rfc5798#section-7.3 https://datatracker.ietf.org/doc/html/rfc5798#section-8.1.2 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2081643/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2083287] [NEW] test_add_router_interfaces_on_overlapping_subnets_returns_400 failed when retrying router interface removal after API timeout; (pymysql.err.OperationalError) (1205
Public bug reported: Failure here: ft1.1: tempest.api.network.test_routers_negative.RoutersNegativeIpV6Test.test_add_router_interfaces_on_overlapping_subnets_returns_400[id-957751a3-3c68-4fa2-93b6-eb52ea10db6e,negative]testtools.testresult.real._StringException: pythonlogging:'': {{{ 2024-09-30 18:50:09,236 75502 INFO [tempest.lib.common.rest_client] Request (RoutersNegativeIpV6Test:test_add_router_interfaces_on_overlapping_subnets_returns_400): 201 POST https://[2001:41d0:302:1000::9e7]/networking/v2.0/networks 0.570s 2024-09-30 18:50:09,236 75502 DEBUG[tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': ''} Body: {"network": {"name": "tempest-router-network01--1466802867"}} Response - Headers: {'date': 'Mon, 30 Sep 2024 18:50:08 GMT', 'server': 'Apache/2.4.52 (Ubuntu)', 'content-type': 'application/json', 'content-length': '573', 'x-openstack-request-id': 'req-289396ba-afdc-4dd9-8538-9a4213985b21', 'connection': 'close', 'status': '201', 'content-location': 'https://[2001:41d0:302:1000::9e7]/networking/v2.0/networks'} Body: b'{"network":{"id":"4a3d5446-c98b-4327-a0d3-62c2be199194","name":"tempest-router-network01--1466802867","tenant_id":"17e56b8c51624d2e910dea6995e4003f","admin_state_up":true,"mtu":1352,"status":"ACTIVE","subnets":[],"shared":false,"project_id":"17e56b8c51624d2e910dea6995e4003f","port_security_enabled":true,"router:external":false,"is_default":false,"availability_zone_hints":[],"availability_zones":[],"ipv4_address_scope":null,"ipv6_address_scope":null,"description":"","tags":[],"created_at":"2024-09-30T18:50:08Z","updated_at":"2024-09-30T18:50:08Z","revision_number":1}}' 2024-09-30 18:50:09,844 75502 INFO [tempest.lib.common.rest_client] Request (RoutersNegativeIpV6Test:test_add_router_interfaces_on_overlapping_subnets_returns_400): 201 POST https://[2001:41d0:302:1000::9e7]/networking/v2.0/networks 0.607s 2024-09-30 18:50:09,845 75502 DEBUG[tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': ''} Body: {"network": {"name": "tempest-router-network02--2017064160"}} Response - Headers: {'date': 'Mon, 30 Sep 2024 18:50:09 GMT', 'server': 'Apache/2.4.52 (Ubuntu)', 'content-type': 'application/json', 'content-length': '573', 'x-openstack-request-id': 'req-73800b2e-ef0a-4591-9f41-1d41c0766d63', 'connection': 'close', 'status': '201', 'content-location': 'https://[2001:41d0:302:1000::9e7]/networking/v2.0/networks'} Body: b'{"network":{"id":"a2a0b2fb-5643-492e-937a-795142d160a0","name":"tempest-router-network02--2017064160","tenant_id":"17e56b8c51624d2e910dea6995e4003f","admin_state_up":true,"mtu":1352,"status":"ACTIVE","subnets":[],"shared":false,"project_id":"17e56b8c51624d2e910dea6995e4003f","port_security_enabled":true,"router:external":false,"is_default":false,"availability_zone_hints":[],"availability_zones":[],"ipv4_address_scope":null,"ipv6_address_scope":null,"description":"","tags":[],"created_at":"2024-09-30T18:50:09Z","updated_at":"2024-09-30T18:50:09Z","revision_number":1}}' 2024-09-30 18:50:10,214 75502 INFO [tempest.lib.common.rest_client] Request (RoutersNegativeIpV6Test:test_add_router_interfaces_on_overlapping_subnets_returns_400): 201 POST https://[2001:41d0:302:1000::9e7]/networking/v2.0/subnets 0.368s 2024-09-30 18:50:10,214 75502 DEBUG[tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': ''} Body: {"subnet": {"network_id": "4a3d5446-c98b-4327-a0d3-62c2be199194", "cidr": "2001:db8::/64", "ip_version": 6, "gateway_ip": "2001:db8::1"}} Response - Headers: {'date': 'Mon, 30 Sep 2024 18:50:09 GMT', 'server': 'Apache/2.4.52 (Ubuntu)', 'content-type': 'application/json', 'content-length': '646', 'x-openstack-request-id': 'req-15bb4af0-20fd-4264-acaf-5a42ac72e5e7', 'connection': 'close', 'status': '201', 'content-location': 'https://[2001:41d0:302:1000::9e7]/networking/v2.0/subnets'} Body: b'{"subnet":{"id":"e431775b-293e-45fd-90b9-2453400019ce","name":"","tenant_id":"17e56b8c51624d2e910dea6995e4003f","network_id":"4a3d5446-c98b-4327-a0d3-62c2be199194","ip_version":6,"subnetpool_id":null,"enable_dhcp":true,"ipv6_ra_mode":null,"ipv6_address_mode":null,"gateway_ip":"2001:db8::1","cidr":"2001:db8::/64","allocation_pools":[{"start":"2001:db8::2","end":"2001:db8:::::"}],"host_routes":[],"dns_nameservers":[],"description":"","router:external":false,"service_types":[],"tags":[],"created_at":"2024-09-30T18:50:10Z","updated_at":"2024-09-30T18:50:10Z","revision_number":0,"project_id":"17e56b8c51624d2e910dea6995e4003f"}}' 2024-09-30 18:50:10,522 75502 INFO [tempest.lib.common.rest_client] Request (RoutersNegativeIpV6Test:test_add_router_interfaces_on_overlapping_subnets_returns_400): 201 POST https://[2001: