[Yahoo-eng-team] [Bug 1446716] Re: Scheduler weighers do not output individual weights to debug log

2024-11-22 Thread Pavel Mracek
I think this issue was resolved by this commit

commit 154ab7b2f9ad80fe432d2c036d5e8c4ee171897b
Author: Balazs Gibizer 
Date:   Thu Nov 11 19:06:56 2021 +0100

Add debug log for scheduler weight calculation

We have all the weighers enabled by default and each can have its own
multiplier making the final compute node order calculation pretty
complex. This patch adds some debug logging that helps understanding how
the final ordering was reached.

Change-Id: I7606d6eb3e08548c1df9dc245ab39cced7de1fb5


** Changed in: nova
   Status: Triaged => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1446716

Title:
  Scheduler weighers do not output individual weights to debug log

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  In the nova scheduler (and nova-cells) the filters report debug level
  logging on what decisions each filter makes on the hosts(cells). The
  weighers, however, only give the combined weight after all have been
  applied. As an operator, I want to be able to see the weight each
  weigher contributes in the debug log so that I can troubleshoot
  scheduler problems better.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1446716/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2030825] Re: s3 backend fails with invalid certificate when using s3 compatible storage

2024-11-22 Thread Abhishek Kekane
** Project changed: glance => glance-store

** Changed in: glance-store
 Assignee: (unassigned) => Cyril Roelandt (cyril-roelandt)

** Changed in: glance-store
   Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/2030825

Title:
  s3 backend fails with invalid certificate when using s3 compatible
  storage

Status in glance_store:
  Fix Committed

Bug description:
  When using the Glance s3 backend, if you are using an s3 compatible
  store, image operations fail with:

  [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-
  signed certificate in certificate chain (_ssl.c:1129).

  The current implementation uses boto3 and assumes you are only using
  Amazon's implementation as there are not currently any settings for
  overriding the CA. In my case, we are using an s3 compatible on-prem
  device which has internal corporate certs. If I override using an
  environment variable of AWS_CA_BUNDLE to my CA bundle, the s3 backend
  then works great.

  Can we see about adding an option to the configuration file for the
  s3_backend so that we can specify the location of a CA bundle so that
  the default CA can be overridden? It appears a few of the other
  options have this functionality already, so we would need to add the
  support for boto3.

  This was tested in Antelope and validated to work once the environment
  variable was added.

To manage notifications about this bug go to:
https://bugs.launchpad.net/glance-store/+bug/2030825/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2075529] Re: Unable to delete "access_as_shared" RBAC policy

2024-11-22 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/935278
Committed: 
https://opendev.org/openstack/neutron/commit/90d836bc420ccd309196ece7908b41b9e2c4f766
Submitter: "Zuul (22348)"
Branch:master

commit 90d836bc420ccd309196ece7908b41b9e2c4f766
Author: Rodolfo Alonso Hernandez 
Date:   Fri Nov 15 11:08:19 2024 +

Filter out the floating IPs when removing a shared RBAC

When a RBAC with action=access_as_shared is removed from a network, it
is checked first that there are no elements (ports) in this network
that could no longer exist due to the RBAC permissions reduction.

The floating IP related ports, that have project_id='' by definition,
should be removed from this check. These ports can be created due to
a RBAC with action=access_as_external. If a floating IP port is present
in the network, it should not block the RBAC with
action=access_as_shared removal.

Closes-Bug: #2075529
Change-Id: I7e31c21c04dc1ef26f5f05537ca0d2cb8f5ca505


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2075529

Title:
  Unable to delete "access_as_shared" RBAC policy

Status in neutron:
  Fix Released

Bug description:
  I encounter a very strange behavior when I try to add and delete the 
"access_as_shared" RBAC policy.
  I can add it successfully, but the subsequent delete doesn't work:

  openstack network rbac create ...   # SUCCESS
  openstack network rbac delete $ID   # FAIL

  Pre-requirements:
  - The network is external.
  - There is a floating IP or router in the network.

  Here is a demo:

  Creating an external network and a Floating IP address:

  [root@devoct30 ~]# openstack network create net0 --external -c id -f value
  9e3285c5-6034-4851-bd72-02d24f5e3f98
  [root@devoct30 ~]# openstack subnet create sub --network net0 --subnet-range 
192.168.100.0/24 --no-dhcp
  [root@devoct30 ~]# openstack floating ip create net0
  [root@devoct30 ~]# openstack network rbac list --long
  
+--+-+--++
  | ID   | Object Type | Object ID  
  | Action |
  
+--+-+--++
  | 324163f7-b79f-493e-a78d-58da0990830e | network | 
9e3285c5-6034-4851-bd72-02d24f5e3f98 | access_as_external |
  
+--+-+--++
  [root@devoct30 ~]#

  
  Adding the "access_as_shared" RBAC policy and trying to delete it:

  [root@devoct30 ~]# openstack network rbac create 
9e3285c5-6034-4851-bd72-02d24f5e3f98 --type  network --action access_as_shared 
--target-all-projects
  +---+--+
  | Field | Value|
  +---+--+
  | action| access_as_shared |
  | id| 4eff94d8-f872-41b3-b3ce-71cdcb40d2e6 |
  | object_id | 9e3285c5-6034-4851-bd72-02d24f5e3f98 |
  | object_type   | network  |
  | project_id| af61bf69ee0a4a7db97d2dd640d967c2 |
  | target_project_id | *|
  +---+--+
  [root@devoct30 ~]# openstack network rbac list --long
  
+--+-+--++
  | ID   | Object Type | Object ID  
  | Action |
  
+--+-+--++
  | 324163f7-b79f-493e-a78d-58da0990830e | network | 
9e3285c5-6034-4851-bd72-02d24f5e3f98 | access_as_external |
  | 4eff94d8-f872-41b3-b3ce-71cdcb40d2e6 | network | 
9e3285c5-6034-4851-bd72-02d24f5e3f98 | access_as_shared   |
  
+--+-+--++
  [root@devoct30 ~]#
  [root@devoct30 ~]# openstack network rbac delete 
4eff94d8-f872-41b3-b3ce-71cdcb40d2e6
  Failed to delete RBAC policy with ID '4eff94d8-f872-41b3-b3ce-71cdcb40d2e6': 
ConflictException: 409: Client Error for url: 
http://10.136.19.166:9696/networking/v2.0/rbac-policies/4eff94d8-f872-41b3-b3ce-71cdcb40d2e6,
 RBAC policy on object 9e3285c5-6034-4851-bd72-02d24f5e3f98 cannot be removed 
because other objects depend on it.
  Details: Callback 
neutron.plugins.ml2.plugin.NeutronDbPluginV2.validate_network_rbac_policy_change-3919969
 failed with "Unable to reconfigure sharing settings for networ

[Yahoo-eng-team] [Bug 2088453] Re: [UT] Neutron tests failing with eventlet 0.37.0

2024-11-22 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/935524
Committed: 
https://opendev.org/openstack/neutron/commit/078a48d803debad81b697d9d006d9ff26d133a33
Submitter: "Zuul (22348)"
Branch:master

commit 078a48d803debad81b697d9d006d9ff26d133a33
Author: Rodolfo Alonso Hernandez 
Date:   Mon Nov 18 14:36:15 2024 +

Replace ``ReaderWriterLock`` with ``threading.RLock``

In case of having a monkey patched executable where the ``threading``
system library is replaced, the class ``RLock`` will be replaced
too.

Closes-Bug: #2088453
Change-Id: Ib0ad82c864a1167d1ea80eb1e065c4015bee3927


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2088453

Title:
  [UT] Neutron tests failing with eventlet 0.37.0

Status in neutron:
  Fix Released

Bug description:
  Several Neutron UTs are failing with the latest eventlet library
  version 0.37.0.

  Log:
  
https://adc88d59e4dd0081446b-fa9050fbccbc1b8e2fcd252255e35175.ssl.cf2.rackcdn.com/933257/2/check/cross-
  neutron-py311/e399c76/testr_results.html

  Snippet: https://paste.opendev.org/show/blSAUTnOndkZTlO6lbI1/

  Requirements patch:
  https://review.opendev.org/c/openstack/requirements/+/933257

  Eventlet patch:
  
https://github.com/eventlet/eventlet/commit/06ec82896ebb9a26edaf6e1ad4d63393990f15b7

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2088453/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2089388] [NEW] nova-scheduler error: SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC

2024-11-22 Thread Pavel Mracek
Public bug reported:

Description
===
If I try to schedule a VM immediately after restarting nova-scheduler service, 
I get this error on every second scheduling request:

keystoneauth1.exceptions.connection.SSLError: SSL exception connecting
to
https://api.ng1.os.ops.xx.xx:8778/allocation_candidates?limit=1000&member_of=in
(Caused by SSLError(SSLError(1, '[SSL:
DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac
(_ssl.c:2578)')))

NOTE: Full trace in attachment.

Is most likely due to the fact that after nova-scheduler restart,
several worker processes based on the parent of SchedulerManager() are
created in nova/cmd/scheduler.py respectively in
../oslo_service/service.py and SchedulerManager(), in its __init__()
method, initializes the placement client by calling the
report.report_client_singleton() method from scheduler/client/report.py,
which should always return the same* instance of class
SchedulerReportClient(): which creates the actual client also within its
__init__() method self._client = self._create_client() line:230 approx.
This already opens a real socket on the placement service.

But this socket is inherited by descendants, so the whole situation after the 
restart (workers=2) is as follows:
root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 
/usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done
PID:1693940
nova-sche 1693940 nova   11u IPv4 2361702700  0t0TCP 
tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1693999
nova-sche 1693999 nova   11u IPv4 2361702700  0t0TCP 
tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1694000
nova-sche 1694000 nova   11u IPv4 2361702700  0t0TCP 
tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)

NOTE: Do you notice that everyone has the same socket open.

If scheduling occurs while this shared socket is open, the first scheduling 
will pass without error and the worker will open a new solo connection for it. 
The situation looks like this:
PID:1693940
nova-sche 1693940 nova   11u IPv4 2361702700  0t0TCP 
tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1693999
nova-sche 1693999 nova   11u IPv4 2361702700  0t0TCP 
tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1694000
nova-sche 1694000 nova   11u IPv4 2361702700  0t0TCP 
tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
nova-sche 1694000 nova   21u IPv4 2361698804  0t0TCP 
tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)


And finally the second attempt in the sequence fails and the open tcp 
connections on placement look like this:
root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 
/usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done
1693940
1693999
1694000
nova-sche 1694000 nova   21u IPv4 2361698804  0t0TCP 
tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)


>From this moment on, each process opens its own connection and everything runs 
>fine. But I don't understand why the first request passes without error.


Steps to reproduce
==
* placement service with keepalive set up (in my case 65s)
* more that 1 nova-scheduler worker (in my case worker==2)
* restart nova-scheduler
* and make two scheduling calls within the keealive time interval 
* second call gives you SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC error


Expected result
===
The parent process does not open any socket for placement, or closes it before 
the child forks


Environment
===
* Ubuntu 22.04.5 LTS
* nova and nova-scheduler 29.0.1-0ubuntu1.4~cloud0
* I also checked the current code in the master repository and the situation is 
the same there
* python3-openssl 21.0.0-1
* openssl 3.0.2-0ubuntu1.18


Related patches
===
https://review.opendev.org/c/x/tobiko/+/880152
https://review.opendev.org/c/openstack/freezer/+/456758


*) Actually, I don't know if the singleton works correctly and as
expected when not using thready but separate forked processes. Is that
OK?

** Affects: nova
 Importance: Undecided
 Status: New

** Attachment added: "decryption failed or bad record mac error"
   
https://bugs.launchpad.net/bugs/2089388/+attachment/5839439/+files/nova-scheduler-SSL_DECRYPTION_FAILED_OR_BAD_RECORD_MAC_full_log.txt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2089388

Title:
  nova-scheduler error: SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  If I try to schedule a VM immediately after restarting nova-scheduler 
service, I get this error on every secon

[Yahoo-eng-team] [Bug 2089157] Re: [neutron-specs] CI job "openstack-tox-docs" broken

2024-11-22 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron-specs/+/935836
Committed: 
https://opendev.org/openstack/neutron-specs/commit/8817b14342e0f77679f77ef78a46068561b0c125
Submitter: "Zuul (22348)"
Branch:master

commit 8817b14342e0f77679f77ef78a46068561b0c125
Author: Brian Haley 
Date:   Wed Nov 20 16:32:12 2024 -0500

Fix docs job errors and warnings

As seqdiag, blockdiag and nwdiag blocks in docs are no
longer supported with the latest pillow code, did the
following to fix the docs job:

  - Made screenshots of seqdiag/blockdiag/nwdiag images and
removed code that built them, started using the images
  - Created *diag files of above for posterity
  - Removed unused footnotes in some other files
  - Removed unnecessary files
  - Removed unused requirements
  - Bumped sphinx>=2.2.0 to match neutron repo

Closes-bug: #2089157
Change-Id: Ie9a1a18af4a21057a6cf8380c664fc4d353d2d73


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2089157

Title:
  [neutron-specs] CI job "openstack-tox-docs" broken

Status in neutron:
  Fix Released

Bug description:
  The neutron-specs CI job "openstack-tox-docs" is now broken.

  Logs:
  https://zuul.opendev.org/t/openstack/build/8d4c3717a67e49d696ee135ec398a6bb

  Snippet: https://paste.opendev.org/show/bntEaQZMSYhlxz8vhXmw/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2089157/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2089169] Re: [OVN] Test ``TestCreateNeutronPgDrop.test_non_existing`` is randomly failing

2024-11-22 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/935802
Committed: 
https://opendev.org/openstack/neutron/commit/6d1dba09923c7333746ad5470ddf352b7916e9f9
Submitter: "Zuul (22348)"
Branch:master

commit 6d1dba09923c7333746ad5470ddf352b7916e9f9
Author: Rodolfo Alonso Hernandez 
Date:   Wed Nov 20 15:05:35 2024 +

[OVN] Add a creation wait event for the PG drop tests

Closes-Bug: #2089169
Change-Id: I3ac6364200f5124d760587612d3a9de55830f2b1


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2089169

Title:
  [OVN] Test ``TestCreateNeutronPgDrop.test_non_existing`` is randomly
  failing

Status in neutron:
  Fix Released

Bug description:
  Logs:
  
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_22b/934652/3/gate/neutron-
  functional-with-uwsgi/22bd251/testr_results.html

  Snippet: https://paste.opendev.org/show/baDyt2wsj0WQFAKPHMH3/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2089169/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2080365] Re: Permission denied on l3-agent and dhcp log

2024-11-22 Thread Launchpad Bug Tracker
[Expired for neutron because there has been no activity for 60 days.]

** Changed in: neutron
   Status: Incomplete => Expired

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2080365

Title:
  Permission denied on l3-agent and dhcp log

Status in neutron:
  Expired

Bug description:
  I got this error on my l3-agent log and I don't know how to fix.

  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task [None 
req-dd0fc086-d150-42b1-8f65-a407c3023cd9 - - - - - -] Error during 
L3NATAgentWithStateReport.periodic_sync_routers_task: PermissionError: [Errno 
13] Permission denied
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task Traceback (most 
recent call last):
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task   File 
"/usr/lib/python3/dist-packages/oslo_service/periodic_task.py", line 216, in 
run_periodic_tasks
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task task(self, 
context)
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/agent.py", line 890, in 
periodic_sync_routers_task
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task with 
self.namespaces_manager as ns_manager:
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/namespace_manager.py", line 
71, in __enter__
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task 
self._all_namespaces = self.list_all()
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/namespace_manager.py", line 
117, in list_all
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task namespaces 
= ip_lib.list_network_namespaces()
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task   File 
"/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 972, in 
list_network_namespaces
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task return 
privileged.list_netns(**kwargs)
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task   File 
"/usr/lib/python3/dist-packages/oslo_privsep/priv_context.py", line 271, in 
_wrap
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task return 
self.channel.remote_call(name, args, kwargs,
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task   File 
"/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 215, in 
remote_call
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task raise 
exc_type(*result[2])
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task 
PermissionError: [Errno 13] Permission denied
  2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task 

  I got this on dhcp-agent log:

  
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent [-] Unable to 
enable dhcp for 02f6efbb-d1dd-402e-9ea3-3e857e4e9408.: PermissionError: [Errno 
13] Permission denied
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent Traceback (most 
recent call last):
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/dhcp/agent.py", line 270, in 
_call_driver
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent rv = 
getattr(driver, action)(**action_kwargs)
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 324, in 
enable
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent 
common_utils.wait_until_true(self._enable, timeout=300)
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent   File 
"/usr/lib/python3/dist-packages/neutron/common/utils.py", line 747, in 
wait_until_true
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent while not 
predicate():
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 336, in 
_enable
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent 
interface_name = self.device_manager.setup(
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 1832, in 
setup
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent 
ip_lib.IPWrapper().ensure_namespace(network.namespace)
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 254, in 
ensure_namespace
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent ip = 
self.netns.add(name)
  2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 736, in add
  2024-09-11 12:00:46.840 2999 ERROR neutron.agen

[Yahoo-eng-team] [Bug 2089403] [NEW] Impossible to filter limits by project ID

2024-11-22 Thread Stephen Finucane
Public bug reported:

The 'openstack limit list' command exposes the limit list ('GET
/v3/limits') API. Both the API and command indicate support for
'project-id' and 'domain-id' filters. However, if using these with a
project-scoped or domain-scoped filter the keystone also adds filters
for the respective project or domain ID from the token, resulting in a
query like the below (using '--project-id' with a project-scoped token):

  SELECT `limit`.internal_id AS limit_internal_id, `limit`.id AS limit_id, 
`limit`.project_id AS limit_project_id, `limit`.domain_id AS limit_domain_id, 
`limit`.resource_limit AS limit_resource_limit, `limit`.description AS 
limit_description, `limit`.registered_limit_id AS limit_registered_limit_id
  FROM `limit` LEFT OUTER JOIN registered_limit ON registered_limit.id = 
`limit`.registered_limit_id
  WHERE `limit`.project_id = %(project_id_1)s AND `limit`.project_id = 
%(project_id_2)s

This means the the filters must exactly match what's in the token or
keystone will attempt to match of two different values resulting in an
empty list. This is a massive gotcha that is not documented anywhere,
leading me to think this is not the expected behaviour and we should
instead only retrieve information from the token if the user didn't
provide any filters.

** Affects: keystone
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Identity (keystone).
https://bugs.launchpad.net/bugs/2089403

Title:
  Impossible to filter limits by project ID

Status in OpenStack Identity (keystone):
  New

Bug description:
  The 'openstack limit list' command exposes the limit list ('GET
  /v3/limits') API. Both the API and command indicate support for
  'project-id' and 'domain-id' filters. However, if using these with a
  project-scoped or domain-scoped filter the keystone also adds filters
  for the respective project or domain ID from the token, resulting in a
  query like the below (using '--project-id' with a project-scoped
  token):

SELECT `limit`.internal_id AS limit_internal_id, `limit`.id AS limit_id, 
`limit`.project_id AS limit_project_id, `limit`.domain_id AS limit_domain_id, 
`limit`.resource_limit AS limit_resource_limit, `limit`.description AS 
limit_description, `limit`.registered_limit_id AS limit_registered_limit_id
FROM `limit` LEFT OUTER JOIN registered_limit ON registered_limit.id = 
`limit`.registered_limit_id
WHERE `limit`.project_id = %(project_id_1)s AND `limit`.project_id = 
%(project_id_2)s

  This means the the filters must exactly match what's in the token or
  keystone will attempt to match of two different values resulting in an
  empty list. This is a massive gotcha that is not documented anywhere,
  leading me to think this is not the expected behaviour and we should
  instead only retrieve information from the token if the user didn't
  provide any filters.

To manage notifications about this bug go to:
https://bugs.launchpad.net/keystone/+bug/2089403/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2089386] [NEW] [RFE] Add Distributed Locking for Host Discovery Operations in Multi-Scheduler Environments

2024-11-22 Thread Serhii Ivanov
Public bug reported:

Add Distributed Locking for Host Discovery Operations in Multi-Scheduler
Environments

Host discovery operations in Nova are currently vulnerable to race conditions 
and concurrent execution issues, particularly in production environments where 
multiple Nova schedulers are running simultaneously for high 
availability/redundancy, and each scheduler:
- Shares the same database backend
- Runs its own periodic automatic host discovery task
- Cron jobs run `nova-manage cell_v2 discover_hosts` periodically on the same 
hosts as the schedulers

Current symptoms (due to overlapping host discovery tasks):
- Possible frequent host discovery failures, missed or incomplete host 
discoveries
- Error messages about duplicate host mappings
- Database conflicts when multiple processes try to map the same hosts 
simultaneously

Proposed Solution: Implement an opt-in distributed locking mechanism for host 
discovery operations to ensure that CLI and periodic automatic host discovery 
tasks run sequentially. The solution should:
1. Be opt-in, enabled via config option
2. Use a distributed lock (leveraging tooz.coordination) before initiating any 
host discovery operation
3. Support coordination across:
   - Scheduler automatic host discovery task
   - `nova-manage cell_v2 discover_hosts` command
4. Extend Nova configuration with an additional config option for defining 
coordinator URI

Benefits:
- Prevents race conditions during host discovery across all scenarios
- Removes the need for external complex scheduling and coordination of 
discovery jobs in high availability/redundancy setups
- Reduces operational overhead by eliminating manual conflict resolution

The solution should be configurable and work across different Nova
deployments without requiring additional external dependencies beyond
what Nova already uses for coordination. This will greatly benefit
highly available, large-scale deployments with multiple schedulers and
automated host discovery operations.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: rfe

** Description changed:

  Add Distributed Locking for Host Discovery Operations in Multi-Scheduler
  Environments
  
  Host discovery operations in Nova are currently vulnerable to race conditions 
and concurrent execution issues, particularly in production environments where 
multiple Nova schedulers are running simultaneously for high 
availability/redundancy, and each scheduler:
  - Shares the same database backend
  - Runs its own periodic automatic host discovery task
  - Cron jobs run `nova-manage cell_v2 discover_hosts` periodically on the same 
hosts as the schedulers
  
  Current symptoms (due to overlapping host discovery tasks):
  - Possible frequent host discovery failures, missed or incomplete host 
discoveries
  - Error messages about duplicate host mappings
  - Database conflicts when multiple processes try to map the same hosts 
simultaneously
  
  Proposed Solution: Implement an opt-in distributed locking mechanism for host 
discovery operations to ensure that CLI and periodic automatic host discovery 
tasks run sequentially. The solution should:
- 1. Use a distributed lock (leveraging tooz.coordination) before initiating 
any host discovery operation
- 2. Support coordination across:
-- Scheduler automatic host discovery task
-- `nova-manage cell_v2 discover_hosts` command
- 3. Extend Nova configuration with an additional config option for defining 
coordinator URI
+ 1. Be opt-in, enabled via config option
+ 2. Use a distributed lock (leveraging tooz.coordination) before initiating 
any host discovery operation
+ 3. Support coordination across:
+    - Scheduler automatic host discovery task
+    - `nova-manage cell_v2 discover_hosts` command
+ 4. Extend Nova configuration with an additional config option for defining 
coordinator URI
  
  Benefits:
  - Prevents race conditions during host discovery across all scenarios
  - Removes the need for external complex scheduling and coordination of 
discovery jobs in high availability/redundancy setups
  - Reduces operational overhead by eliminating manual conflict resolution
  
  The solution should be configurable and work across different Nova
  deployments without requiring additional external dependencies beyond
  what Nova already uses for coordination. This will greatly benefit
  highly available, large-scale deployments with multiple schedulers and
  automated host discovery operations.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2089386

Title:
  [RFE] Add Distributed Locking for Host Discovery Operations in Multi-
  Scheduler Environments

Status in OpenStack Compute (nova):
  New

Bug description:
  Add Distributed Locking for Host Discovery Operations in Multi-
  Scheduler Environments

  Host discovery operations in Nova are currently vulnerable to 

[Yahoo-eng-team] [Bug 2089386] Re: [RFE] Add Distributed Locking for Host Discovery Operations in Multi-Scheduler Environments

2024-11-22 Thread sean mooney
im not going to mark this as invliad but introducing a distribute lock
manger to nova i think would require a spec.

its a very heavy weight solution to enabling a topology we do not
officially support today.

today we require that if the period is enable then its only enabled in
one scheduled isntace precisely to mitigate the problem described here.


that does not mean we cannot improve the current situation or that we cant 
dicussiotn this but it would be a feature not a bug as this is an existing, 
know limitation of the perodic and therefore not a bug.


alternitves are
 - externally scheduling the hostmapping (via corn or k8s job)
 - using https://en.wikipedia.org/wiki/Rendezvous_hashing to distribute the 
mapping tasks between all scheduled to get eventual consitance.
 - gracefully hanedingl the db conflict and proceedign with the other mapping 
move the error/wrarning to debug level.

tooz has a low number of maintainer and nova was planning to remove it
form our dep list with the removal of the ironic drivers use of its
hashring.

as a general design goal nova intends the schdler service to be
effectively stateless and horizontally scalable, adding any kind of
distributed locking limits that scalability and it a non-trivial cost to
require a tooz persistence backend just for this.


one enhancement that should be made is the config option currently does not 
carry the guance that it should only be enabled on one schduelr 

https://docs.openstack.org/nova/latest/configuration/config.html#scheduler.discover_hosts_in_cells_interval

while I believe that is discussed elsewhere in the docs if you only look
at that then its not obvious that this is not recommended or supported.


** Changed in: nova
   Status: New => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2089386

Title:
  [RFE] Add Distributed Locking for Host Discovery Operations in Multi-
  Scheduler Environments

Status in OpenStack Compute (nova):
  Opinion

Bug description:
  Add Distributed Locking for Host Discovery Operations in Multi-
  Scheduler Environments

  Host discovery operations in Nova are currently vulnerable to race conditions 
and concurrent execution issues, particularly in production environments where 
multiple Nova schedulers are running simultaneously for high 
availability/redundancy, and each scheduler:
  - Shares the same database backend
  - Runs its own periodic automatic host discovery task
  - Cron jobs run `nova-manage cell_v2 discover_hosts` periodically on the same 
hosts as the schedulers

  Current symptoms (due to overlapping host discovery tasks):
  - Possible frequent host discovery failures, missed or incomplete host 
discoveries
  - Error messages about duplicate host mappings
  - Database conflicts when multiple processes try to map the same hosts 
simultaneously

  Proposed Solution: Implement an opt-in distributed locking mechanism for host 
discovery operations to ensure that CLI and periodic automatic host discovery 
tasks run sequentially. The solution should:
  1. Be opt-in, enabled via config option
  2. Use a distributed lock (leveraging tooz.coordination) before initiating 
any host discovery operation
  3. Support coordination across:
     - Scheduler automatic host discovery task
     - `nova-manage cell_v2 discover_hosts` command
  4. Extend Nova configuration with an additional config option for defining 
coordinator URI

  Benefits:
  - Prevents race conditions during host discovery across all scenarios
  - Removes the need for external complex scheduling and coordination of 
discovery jobs in high availability/redundancy setups
  - Reduces operational overhead by eliminating manual conflict resolution

  The solution should be configurable and work across different Nova
  deployments without requiring additional external dependencies beyond
  what Nova already uses for coordination. This will greatly benefit
  highly available, large-scale deployments with multiple schedulers and
  automated host discovery operations.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2089386/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp