[Yahoo-eng-team] [Bug 1446716] Re: Scheduler weighers do not output individual weights to debug log
I think this issue was resolved by this commit commit 154ab7b2f9ad80fe432d2c036d5e8c4ee171897b Author: Balazs Gibizer Date: Thu Nov 11 19:06:56 2021 +0100 Add debug log for scheduler weight calculation We have all the weighers enabled by default and each can have its own multiplier making the final compute node order calculation pretty complex. This patch adds some debug logging that helps understanding how the final ordering was reached. Change-Id: I7606d6eb3e08548c1df9dc245ab39cced7de1fb5 ** Changed in: nova Status: Triaged => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1446716 Title: Scheduler weighers do not output individual weights to debug log Status in OpenStack Compute (nova): Fix Released Bug description: In the nova scheduler (and nova-cells) the filters report debug level logging on what decisions each filter makes on the hosts(cells). The weighers, however, only give the combined weight after all have been applied. As an operator, I want to be able to see the weight each weigher contributes in the debug log so that I can troubleshoot scheduler problems better. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1446716/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2030825] Re: s3 backend fails with invalid certificate when using s3 compatible storage
** Project changed: glance => glance-store ** Changed in: glance-store Assignee: (unassigned) => Cyril Roelandt (cyril-roelandt) ** Changed in: glance-store Status: New => Fix Committed -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to Glance. https://bugs.launchpad.net/bugs/2030825 Title: s3 backend fails with invalid certificate when using s3 compatible storage Status in glance_store: Fix Committed Bug description: When using the Glance s3 backend, if you are using an s3 compatible store, image operations fail with: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self- signed certificate in certificate chain (_ssl.c:1129). The current implementation uses boto3 and assumes you are only using Amazon's implementation as there are not currently any settings for overriding the CA. In my case, we are using an s3 compatible on-prem device which has internal corporate certs. If I override using an environment variable of AWS_CA_BUNDLE to my CA bundle, the s3 backend then works great. Can we see about adding an option to the configuration file for the s3_backend so that we can specify the location of a CA bundle so that the default CA can be overridden? It appears a few of the other options have this functionality already, so we would need to add the support for boto3. This was tested in Antelope and validated to work once the environment variable was added. To manage notifications about this bug go to: https://bugs.launchpad.net/glance-store/+bug/2030825/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2075529] Re: Unable to delete "access_as_shared" RBAC policy
Reviewed: https://review.opendev.org/c/openstack/neutron/+/935278 Committed: https://opendev.org/openstack/neutron/commit/90d836bc420ccd309196ece7908b41b9e2c4f766 Submitter: "Zuul (22348)" Branch:master commit 90d836bc420ccd309196ece7908b41b9e2c4f766 Author: Rodolfo Alonso Hernandez Date: Fri Nov 15 11:08:19 2024 + Filter out the floating IPs when removing a shared RBAC When a RBAC with action=access_as_shared is removed from a network, it is checked first that there are no elements (ports) in this network that could no longer exist due to the RBAC permissions reduction. The floating IP related ports, that have project_id='' by definition, should be removed from this check. These ports can be created due to a RBAC with action=access_as_external. If a floating IP port is present in the network, it should not block the RBAC with action=access_as_shared removal. Closes-Bug: #2075529 Change-Id: I7e31c21c04dc1ef26f5f05537ca0d2cb8f5ca505 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2075529 Title: Unable to delete "access_as_shared" RBAC policy Status in neutron: Fix Released Bug description: I encounter a very strange behavior when I try to add and delete the "access_as_shared" RBAC policy. I can add it successfully, but the subsequent delete doesn't work: openstack network rbac create ... # SUCCESS openstack network rbac delete $ID # FAIL Pre-requirements: - The network is external. - There is a floating IP or router in the network. Here is a demo: Creating an external network and a Floating IP address: [root@devoct30 ~]# openstack network create net0 --external -c id -f value 9e3285c5-6034-4851-bd72-02d24f5e3f98 [root@devoct30 ~]# openstack subnet create sub --network net0 --subnet-range 192.168.100.0/24 --no-dhcp [root@devoct30 ~]# openstack floating ip create net0 [root@devoct30 ~]# openstack network rbac list --long +--+-+--++ | ID | Object Type | Object ID | Action | +--+-+--++ | 324163f7-b79f-493e-a78d-58da0990830e | network | 9e3285c5-6034-4851-bd72-02d24f5e3f98 | access_as_external | +--+-+--++ [root@devoct30 ~]# Adding the "access_as_shared" RBAC policy and trying to delete it: [root@devoct30 ~]# openstack network rbac create 9e3285c5-6034-4851-bd72-02d24f5e3f98 --type network --action access_as_shared --target-all-projects +---+--+ | Field | Value| +---+--+ | action| access_as_shared | | id| 4eff94d8-f872-41b3-b3ce-71cdcb40d2e6 | | object_id | 9e3285c5-6034-4851-bd72-02d24f5e3f98 | | object_type | network | | project_id| af61bf69ee0a4a7db97d2dd640d967c2 | | target_project_id | *| +---+--+ [root@devoct30 ~]# openstack network rbac list --long +--+-+--++ | ID | Object Type | Object ID | Action | +--+-+--++ | 324163f7-b79f-493e-a78d-58da0990830e | network | 9e3285c5-6034-4851-bd72-02d24f5e3f98 | access_as_external | | 4eff94d8-f872-41b3-b3ce-71cdcb40d2e6 | network | 9e3285c5-6034-4851-bd72-02d24f5e3f98 | access_as_shared | +--+-+--++ [root@devoct30 ~]# [root@devoct30 ~]# openstack network rbac delete 4eff94d8-f872-41b3-b3ce-71cdcb40d2e6 Failed to delete RBAC policy with ID '4eff94d8-f872-41b3-b3ce-71cdcb40d2e6': ConflictException: 409: Client Error for url: http://10.136.19.166:9696/networking/v2.0/rbac-policies/4eff94d8-f872-41b3-b3ce-71cdcb40d2e6, RBAC policy on object 9e3285c5-6034-4851-bd72-02d24f5e3f98 cannot be removed because other objects depend on it. Details: Callback neutron.plugins.ml2.plugin.NeutronDbPluginV2.validate_network_rbac_policy_change-3919969 failed with "Unable to reconfigure sharing settings for networ
[Yahoo-eng-team] [Bug 2088453] Re: [UT] Neutron tests failing with eventlet 0.37.0
Reviewed: https://review.opendev.org/c/openstack/neutron/+/935524 Committed: https://opendev.org/openstack/neutron/commit/078a48d803debad81b697d9d006d9ff26d133a33 Submitter: "Zuul (22348)" Branch:master commit 078a48d803debad81b697d9d006d9ff26d133a33 Author: Rodolfo Alonso Hernandez Date: Mon Nov 18 14:36:15 2024 + Replace ``ReaderWriterLock`` with ``threading.RLock`` In case of having a monkey patched executable where the ``threading`` system library is replaced, the class ``RLock`` will be replaced too. Closes-Bug: #2088453 Change-Id: Ib0ad82c864a1167d1ea80eb1e065c4015bee3927 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2088453 Title: [UT] Neutron tests failing with eventlet 0.37.0 Status in neutron: Fix Released Bug description: Several Neutron UTs are failing with the latest eventlet library version 0.37.0. Log: https://adc88d59e4dd0081446b-fa9050fbccbc1b8e2fcd252255e35175.ssl.cf2.rackcdn.com/933257/2/check/cross- neutron-py311/e399c76/testr_results.html Snippet: https://paste.opendev.org/show/blSAUTnOndkZTlO6lbI1/ Requirements patch: https://review.opendev.org/c/openstack/requirements/+/933257 Eventlet patch: https://github.com/eventlet/eventlet/commit/06ec82896ebb9a26edaf6e1ad4d63393990f15b7 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2088453/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2089388] [NEW] nova-scheduler error: SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC
Public bug reported: Description === If I try to schedule a VM immediately after restarting nova-scheduler service, I get this error on every second scheduling request: keystoneauth1.exceptions.connection.SSLError: SSL exception connecting to https://api.ng1.os.ops.xx.xx:8778/allocation_candidates?limit=1000&member_of=in (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2578)'))) NOTE: Full trace in attachment. Is most likely due to the fact that after nova-scheduler restart, several worker processes based on the parent of SchedulerManager() are created in nova/cmd/scheduler.py respectively in ../oslo_service/service.py and SchedulerManager(), in its __init__() method, initializes the placement client by calling the report.report_client_singleton() method from scheduler/client/report.py, which should always return the same* instance of class SchedulerReportClient(): which creates the actual client also within its __init__() method self._client = self._create_client() line:230 approx. This already opens a real socket on the placement service. But this socket is inherited by descendants, so the whole situation after the restart (workers=2) is as follows: root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done PID:1693940 nova-sche 1693940 nova 11u IPv4 2361702700 0t0TCP tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1693999 nova-sche 1693999 nova 11u IPv4 2361702700 0t0TCP tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1694000 nova-sche 1694000 nova 11u IPv4 2361702700 0t0TCP tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) NOTE: Do you notice that everyone has the same socket open. If scheduling occurs while this shared socket is open, the first scheduling will pass without error and the worker will open a new solo connection for it. The situation looks like this: PID:1693940 nova-sche 1693940 nova 11u IPv4 2361702700 0t0TCP tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1693999 nova-sche 1693999 nova 11u IPv4 2361702700 0t0TCP tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1694000 nova-sche 1694000 nova 11u IPv4 2361702700 0t0TCP tt-os1-lab1.ko.xx.xx:3->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) nova-sche 1694000 nova 21u IPv4 2361698804 0t0TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) And finally the second attempt in the sequence fails and the open tcp connections on placement look like this: root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done 1693940 1693999 1694000 nova-sche 1694000 nova 21u IPv4 2361698804 0t0TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) >From this moment on, each process opens its own connection and everything runs >fine. But I don't understand why the first request passes without error. Steps to reproduce == * placement service with keepalive set up (in my case 65s) * more that 1 nova-scheduler worker (in my case worker==2) * restart nova-scheduler * and make two scheduling calls within the keealive time interval * second call gives you SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC error Expected result === The parent process does not open any socket for placement, or closes it before the child forks Environment === * Ubuntu 22.04.5 LTS * nova and nova-scheduler 29.0.1-0ubuntu1.4~cloud0 * I also checked the current code in the master repository and the situation is the same there * python3-openssl 21.0.0-1 * openssl 3.0.2-0ubuntu1.18 Related patches === https://review.opendev.org/c/x/tobiko/+/880152 https://review.opendev.org/c/openstack/freezer/+/456758 *) Actually, I don't know if the singleton works correctly and as expected when not using thready but separate forked processes. Is that OK? ** Affects: nova Importance: Undecided Status: New ** Attachment added: "decryption failed or bad record mac error" https://bugs.launchpad.net/bugs/2089388/+attachment/5839439/+files/nova-scheduler-SSL_DECRYPTION_FAILED_OR_BAD_RECORD_MAC_full_log.txt -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2089388 Title: nova-scheduler error: SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC Status in OpenStack Compute (nova): New Bug description: Description === If I try to schedule a VM immediately after restarting nova-scheduler service, I get this error on every secon
[Yahoo-eng-team] [Bug 2089157] Re: [neutron-specs] CI job "openstack-tox-docs" broken
Reviewed: https://review.opendev.org/c/openstack/neutron-specs/+/935836 Committed: https://opendev.org/openstack/neutron-specs/commit/8817b14342e0f77679f77ef78a46068561b0c125 Submitter: "Zuul (22348)" Branch:master commit 8817b14342e0f77679f77ef78a46068561b0c125 Author: Brian Haley Date: Wed Nov 20 16:32:12 2024 -0500 Fix docs job errors and warnings As seqdiag, blockdiag and nwdiag blocks in docs are no longer supported with the latest pillow code, did the following to fix the docs job: - Made screenshots of seqdiag/blockdiag/nwdiag images and removed code that built them, started using the images - Created *diag files of above for posterity - Removed unused footnotes in some other files - Removed unnecessary files - Removed unused requirements - Bumped sphinx>=2.2.0 to match neutron repo Closes-bug: #2089157 Change-Id: Ie9a1a18af4a21057a6cf8380c664fc4d353d2d73 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2089157 Title: [neutron-specs] CI job "openstack-tox-docs" broken Status in neutron: Fix Released Bug description: The neutron-specs CI job "openstack-tox-docs" is now broken. Logs: https://zuul.opendev.org/t/openstack/build/8d4c3717a67e49d696ee135ec398a6bb Snippet: https://paste.opendev.org/show/bntEaQZMSYhlxz8vhXmw/ To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2089157/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2089169] Re: [OVN] Test ``TestCreateNeutronPgDrop.test_non_existing`` is randomly failing
Reviewed: https://review.opendev.org/c/openstack/neutron/+/935802 Committed: https://opendev.org/openstack/neutron/commit/6d1dba09923c7333746ad5470ddf352b7916e9f9 Submitter: "Zuul (22348)" Branch:master commit 6d1dba09923c7333746ad5470ddf352b7916e9f9 Author: Rodolfo Alonso Hernandez Date: Wed Nov 20 15:05:35 2024 + [OVN] Add a creation wait event for the PG drop tests Closes-Bug: #2089169 Change-Id: I3ac6364200f5124d760587612d3a9de55830f2b1 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2089169 Title: [OVN] Test ``TestCreateNeutronPgDrop.test_non_existing`` is randomly failing Status in neutron: Fix Released Bug description: Logs: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_22b/934652/3/gate/neutron- functional-with-uwsgi/22bd251/testr_results.html Snippet: https://paste.opendev.org/show/baDyt2wsj0WQFAKPHMH3/ To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2089169/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2080365] Re: Permission denied on l3-agent and dhcp log
[Expired for neutron because there has been no activity for 60 days.] ** Changed in: neutron Status: Incomplete => Expired -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2080365 Title: Permission denied on l3-agent and dhcp log Status in neutron: Expired Bug description: I got this error on my l3-agent log and I don't know how to fix. 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task [None req-dd0fc086-d150-42b1-8f65-a407c3023cd9 - - - - - -] Error during L3NATAgentWithStateReport.periodic_sync_routers_task: PermissionError: [Errno 13] Permission denied 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task Traceback (most recent call last): 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task File "/usr/lib/python3/dist-packages/oslo_service/periodic_task.py", line 216, in run_periodic_tasks 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task task(self, context) 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task File "/usr/lib/python3/dist-packages/neutron/agent/l3/agent.py", line 890, in periodic_sync_routers_task 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task with self.namespaces_manager as ns_manager: 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task File "/usr/lib/python3/dist-packages/neutron/agent/l3/namespace_manager.py", line 71, in __enter__ 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task self._all_namespaces = self.list_all() 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task File "/usr/lib/python3/dist-packages/neutron/agent/l3/namespace_manager.py", line 117, in list_all 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task namespaces = ip_lib.list_network_namespaces() 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 972, in list_network_namespaces 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task return privileged.list_netns(**kwargs) 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task File "/usr/lib/python3/dist-packages/oslo_privsep/priv_context.py", line 271, in _wrap 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task return self.channel.remote_call(name, args, kwargs, 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task File "/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 215, in remote_call 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task raise exc_type(*result[2]) 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task PermissionError: [Errno 13] Permission denied 2024-09-11 11:35:00.672 3022 ERROR oslo_service.periodic_task I got this on dhcp-agent log: 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent [-] Unable to enable dhcp for 02f6efbb-d1dd-402e-9ea3-3e857e4e9408.: PermissionError: [Errno 13] Permission denied 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent Traceback (most recent call last): 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/dhcp/agent.py", line 270, in _call_driver 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent rv = getattr(driver, action)(**action_kwargs) 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 324, in enable 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent common_utils.wait_until_true(self._enable, timeout=300) 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/common/utils.py", line 747, in wait_until_true 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent while not predicate(): 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 336, in _enable 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent interface_name = self.device_manager.setup( 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 1832, in setup 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent ip_lib.IPWrapper().ensure_namespace(network.namespace) 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 254, in ensure_namespace 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent ip = self.netns.add(name) 2024-09-11 12:00:46.840 2999 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 736, in add 2024-09-11 12:00:46.840 2999 ERROR neutron.agen
[Yahoo-eng-team] [Bug 2089403] [NEW] Impossible to filter limits by project ID
Public bug reported: The 'openstack limit list' command exposes the limit list ('GET /v3/limits') API. Both the API and command indicate support for 'project-id' and 'domain-id' filters. However, if using these with a project-scoped or domain-scoped filter the keystone also adds filters for the respective project or domain ID from the token, resulting in a query like the below (using '--project-id' with a project-scoped token): SELECT `limit`.internal_id AS limit_internal_id, `limit`.id AS limit_id, `limit`.project_id AS limit_project_id, `limit`.domain_id AS limit_domain_id, `limit`.resource_limit AS limit_resource_limit, `limit`.description AS limit_description, `limit`.registered_limit_id AS limit_registered_limit_id FROM `limit` LEFT OUTER JOIN registered_limit ON registered_limit.id = `limit`.registered_limit_id WHERE `limit`.project_id = %(project_id_1)s AND `limit`.project_id = %(project_id_2)s This means the the filters must exactly match what's in the token or keystone will attempt to match of two different values resulting in an empty list. This is a massive gotcha that is not documented anywhere, leading me to think this is not the expected behaviour and we should instead only retrieve information from the token if the user didn't provide any filters. ** Affects: keystone Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Identity (keystone). https://bugs.launchpad.net/bugs/2089403 Title: Impossible to filter limits by project ID Status in OpenStack Identity (keystone): New Bug description: The 'openstack limit list' command exposes the limit list ('GET /v3/limits') API. Both the API and command indicate support for 'project-id' and 'domain-id' filters. However, if using these with a project-scoped or domain-scoped filter the keystone also adds filters for the respective project or domain ID from the token, resulting in a query like the below (using '--project-id' with a project-scoped token): SELECT `limit`.internal_id AS limit_internal_id, `limit`.id AS limit_id, `limit`.project_id AS limit_project_id, `limit`.domain_id AS limit_domain_id, `limit`.resource_limit AS limit_resource_limit, `limit`.description AS limit_description, `limit`.registered_limit_id AS limit_registered_limit_id FROM `limit` LEFT OUTER JOIN registered_limit ON registered_limit.id = `limit`.registered_limit_id WHERE `limit`.project_id = %(project_id_1)s AND `limit`.project_id = %(project_id_2)s This means the the filters must exactly match what's in the token or keystone will attempt to match of two different values resulting in an empty list. This is a massive gotcha that is not documented anywhere, leading me to think this is not the expected behaviour and we should instead only retrieve information from the token if the user didn't provide any filters. To manage notifications about this bug go to: https://bugs.launchpad.net/keystone/+bug/2089403/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2089386] [NEW] [RFE] Add Distributed Locking for Host Discovery Operations in Multi-Scheduler Environments
Public bug reported: Add Distributed Locking for Host Discovery Operations in Multi-Scheduler Environments Host discovery operations in Nova are currently vulnerable to race conditions and concurrent execution issues, particularly in production environments where multiple Nova schedulers are running simultaneously for high availability/redundancy, and each scheduler: - Shares the same database backend - Runs its own periodic automatic host discovery task - Cron jobs run `nova-manage cell_v2 discover_hosts` periodically on the same hosts as the schedulers Current symptoms (due to overlapping host discovery tasks): - Possible frequent host discovery failures, missed or incomplete host discoveries - Error messages about duplicate host mappings - Database conflicts when multiple processes try to map the same hosts simultaneously Proposed Solution: Implement an opt-in distributed locking mechanism for host discovery operations to ensure that CLI and periodic automatic host discovery tasks run sequentially. The solution should: 1. Be opt-in, enabled via config option 2. Use a distributed lock (leveraging tooz.coordination) before initiating any host discovery operation 3. Support coordination across: - Scheduler automatic host discovery task - `nova-manage cell_v2 discover_hosts` command 4. Extend Nova configuration with an additional config option for defining coordinator URI Benefits: - Prevents race conditions during host discovery across all scenarios - Removes the need for external complex scheduling and coordination of discovery jobs in high availability/redundancy setups - Reduces operational overhead by eliminating manual conflict resolution The solution should be configurable and work across different Nova deployments without requiring additional external dependencies beyond what Nova already uses for coordination. This will greatly benefit highly available, large-scale deployments with multiple schedulers and automated host discovery operations. ** Affects: nova Importance: Undecided Status: New ** Tags: rfe ** Description changed: Add Distributed Locking for Host Discovery Operations in Multi-Scheduler Environments Host discovery operations in Nova are currently vulnerable to race conditions and concurrent execution issues, particularly in production environments where multiple Nova schedulers are running simultaneously for high availability/redundancy, and each scheduler: - Shares the same database backend - Runs its own periodic automatic host discovery task - Cron jobs run `nova-manage cell_v2 discover_hosts` periodically on the same hosts as the schedulers Current symptoms (due to overlapping host discovery tasks): - Possible frequent host discovery failures, missed or incomplete host discoveries - Error messages about duplicate host mappings - Database conflicts when multiple processes try to map the same hosts simultaneously Proposed Solution: Implement an opt-in distributed locking mechanism for host discovery operations to ensure that CLI and periodic automatic host discovery tasks run sequentially. The solution should: - 1. Use a distributed lock (leveraging tooz.coordination) before initiating any host discovery operation - 2. Support coordination across: -- Scheduler automatic host discovery task -- `nova-manage cell_v2 discover_hosts` command - 3. Extend Nova configuration with an additional config option for defining coordinator URI + 1. Be opt-in, enabled via config option + 2. Use a distributed lock (leveraging tooz.coordination) before initiating any host discovery operation + 3. Support coordination across: + - Scheduler automatic host discovery task + - `nova-manage cell_v2 discover_hosts` command + 4. Extend Nova configuration with an additional config option for defining coordinator URI Benefits: - Prevents race conditions during host discovery across all scenarios - Removes the need for external complex scheduling and coordination of discovery jobs in high availability/redundancy setups - Reduces operational overhead by eliminating manual conflict resolution The solution should be configurable and work across different Nova deployments without requiring additional external dependencies beyond what Nova already uses for coordination. This will greatly benefit highly available, large-scale deployments with multiple schedulers and automated host discovery operations. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2089386 Title: [RFE] Add Distributed Locking for Host Discovery Operations in Multi- Scheduler Environments Status in OpenStack Compute (nova): New Bug description: Add Distributed Locking for Host Discovery Operations in Multi- Scheduler Environments Host discovery operations in Nova are currently vulnerable to
[Yahoo-eng-team] [Bug 2089386] Re: [RFE] Add Distributed Locking for Host Discovery Operations in Multi-Scheduler Environments
im not going to mark this as invliad but introducing a distribute lock manger to nova i think would require a spec. its a very heavy weight solution to enabling a topology we do not officially support today. today we require that if the period is enable then its only enabled in one scheduled isntace precisely to mitigate the problem described here. that does not mean we cannot improve the current situation or that we cant dicussiotn this but it would be a feature not a bug as this is an existing, know limitation of the perodic and therefore not a bug. alternitves are - externally scheduling the hostmapping (via corn or k8s job) - using https://en.wikipedia.org/wiki/Rendezvous_hashing to distribute the mapping tasks between all scheduled to get eventual consitance. - gracefully hanedingl the db conflict and proceedign with the other mapping move the error/wrarning to debug level. tooz has a low number of maintainer and nova was planning to remove it form our dep list with the removal of the ironic drivers use of its hashring. as a general design goal nova intends the schdler service to be effectively stateless and horizontally scalable, adding any kind of distributed locking limits that scalability and it a non-trivial cost to require a tooz persistence backend just for this. one enhancement that should be made is the config option currently does not carry the guance that it should only be enabled on one schduelr https://docs.openstack.org/nova/latest/configuration/config.html#scheduler.discover_hosts_in_cells_interval while I believe that is discussed elsewhere in the docs if you only look at that then its not obvious that this is not recommended or supported. ** Changed in: nova Status: New => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2089386 Title: [RFE] Add Distributed Locking for Host Discovery Operations in Multi- Scheduler Environments Status in OpenStack Compute (nova): Opinion Bug description: Add Distributed Locking for Host Discovery Operations in Multi- Scheduler Environments Host discovery operations in Nova are currently vulnerable to race conditions and concurrent execution issues, particularly in production environments where multiple Nova schedulers are running simultaneously for high availability/redundancy, and each scheduler: - Shares the same database backend - Runs its own periodic automatic host discovery task - Cron jobs run `nova-manage cell_v2 discover_hosts` periodically on the same hosts as the schedulers Current symptoms (due to overlapping host discovery tasks): - Possible frequent host discovery failures, missed or incomplete host discoveries - Error messages about duplicate host mappings - Database conflicts when multiple processes try to map the same hosts simultaneously Proposed Solution: Implement an opt-in distributed locking mechanism for host discovery operations to ensure that CLI and periodic automatic host discovery tasks run sequentially. The solution should: 1. Be opt-in, enabled via config option 2. Use a distributed lock (leveraging tooz.coordination) before initiating any host discovery operation 3. Support coordination across: - Scheduler automatic host discovery task - `nova-manage cell_v2 discover_hosts` command 4. Extend Nova configuration with an additional config option for defining coordinator URI Benefits: - Prevents race conditions during host discovery across all scenarios - Removes the need for external complex scheduling and coordination of discovery jobs in high availability/redundancy setups - Reduces operational overhead by eliminating manual conflict resolution The solution should be configurable and work across different Nova deployments without requiring additional external dependencies beyond what Nova already uses for coordination. This will greatly benefit highly available, large-scale deployments with multiple schedulers and automated host discovery operations. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2089386/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp