Oooooohhhhh.. *click!

I wasn't aware that keystone could do the header verification. That was my
missing puzzle piece. And now everything makes sense.
Thanks a lot for your patience!

Then I can also close the bug report now, because after the changes made by
our keystone team the amount of request on the keystone dropped off a cliff.

Wish you all a happy weekend.

Am Fr., 7. Nov. 2025 um 11:57 Uhr schrieb Tobias Urdin - Binero <
[email protected]>:

> Hello Boris,
>
> What you’re missing in your flow is accounting for the AWS v4 signature
> algorithm used for S3, you can look it up to understand it better.
>
> I will try to walk through the flow but will grossly oversimplify it so
> that
> it’s easier to follow in bullet list form.
>
> 1. Client prepares a HTTP request, computes a hash of multiple things
> contained in the request combined with the secret key (giving us: the
> signature) and
> adds that to the headers with the access key.
>
> 2. RadosGW receieves the request and sends the signature to Keystone’s
> /v3/s3tokens
> endpoint which verifies that the the access key exists and that the
> signature is the
> same if it’s computed with the secret key stored in Keystone.
>
> The request is now authenticated, Keystone says it’s OK, but RadosGW has
> a problem, it doesn’t know the secret key so we cannot cache anything and
> must talk to Keystone on each request, so...
>
> 3. RadosGW does a request against Keystone
> /v3/users/<user>/OS-EC2/<credential>
> endpoint, it knows the user because step #2 returns a Keystone token
> containing the
> info and the credential is the access key. If this request is successful
> RadosGW now
> has the secret key and adds it in the secret cache.
>
> Now on the next request to RadosGW it will lookup the secret key in the
> secret cache
> keyed by the access key, we now have the secret key and RadosGW itself can
> perform
> step #2 internally by computing the signature and verify the request.
>
> Now if #3 fails – you end up in the state you were, it does requests
> against Keystone
> on each request and it cannot populate the secret cache.
>
> This is a good example of a chicken and egg problem with caching, we could
> fail the
> request because we could not cache it but it’s already authenticated,
> perhaps the
> Keystone error was temporary and we can try again on next iteration, or
> should we
> drop the request even though we could perform it. It’s semantics on how we
> should
> handle the result of the attempt to populate the secret cache, both
> scenarios is valid, so
> maybe a config option :)
>
> Personally I would prefer to reject the request, but that is not what’s
> done today. When
> the request in #3 fails it should log a "s3 keystone: secret fetching
> error: %s” error in
> the RadosGW log if config option debug_rgw >= 2.
>
> Hope that helps.
>
> /Tobias
>
> On 7 Nov 2025, at 10:07, Boris <[email protected]> wrote:
>
> What I still don't understand:
>
> If the request to get the ec2 credentials from keystone ran into a 429,
> why did it work the first time?
> How does the keystone authentication work for radosgw under the hood? Is
> there a documentation I can read up on?
>
> I would have thought it was something like:
> 1. Authenticate with the keystone with a special set of credentials
> (rgw_keystone_admin_user)
> 2. Fetch the EC2 credentials for the provided access key
> 3. Save those credentials for the time that keystone told with the token
> lifetime
> 4. Do the normal s3 authentication with the cached credentials
>
> I would have thought that these tokens live in the memory of the radosgw
> daemon and every rgw daemon keeps track on it's own.
> But as I am writing this down I think "what happens to the cached
> credentials if the ec2 credentials will be invalidated by the end user?"
>
>
>
> btw: This is what our keystone team told me:
> >Tobias is top notch, he knows both keystone and ceph, he has some
> comments on the recent keystone cve we patched: launchpad.net/bugs/2119646
> >What he says makes sense. We already have the new policy that comes with
> the OSSA patch on /v3/s3tokens and you have it.
> >If you want you can also tell him that we fixed the issue not by granting
> admin but by granting /v3/users/<user>/OS-EC2/<credential> via a custom
> role policy.
>
> Am Do., 6. Nov. 2025 um 15:45 Uhr schrieb Boris <[email protected]>:
>
>> Hi Tobias,
>>
>> thanks a lot for the in depth explanation. The keystone team fixed
>> something yesterday regarding the mentioned bug and now the 429s are gone.
>> We still have no clue why rgw worked at all and I sill try to understand
>> it.
>>
>> Do you attend the ceph days in Berlin next week?
>>
>>
>>
>> Am Do., 6. Nov. 2025 um 08:26 Uhr schrieb Tobias Urdin - Binero <
>> [email protected]>:
>>
>>> Hello Boris,
>>>
>>> Then that is probably your issue, ask the team maintaining OpenStack
>>> Keystone to check the logs for requests
>>> to that API endpoint that is failing with 403 on every request from
>>> RadosGW, similiar to this:
>>>
>>>     "GET /v3/users/<user>/credentials/OS-EC2/<credential> HTTP/1.1" 403
>>> 140 "-" "-"
>>>
>>> What this means is that your authentication will work but because
>>> RadosGW cannot retrieve the EC2 credential
>>> secret and will not populate the cache and you will do authentication
>>> against Keystone on each request.
>>>
>>> ---
>>>
>>> Let me try to clear things up a bit, hopefully. RadosGW needs to perform
>>> these API requests against Keystone:
>>>
>>> /v3/auth/tokens – No policy enforce on who can talk to this API
>>> (rgw_keystone_admin_user does not need any
>>> special role). Patches and backports [1] has been done to simply drop
>>> the admin token usage in this API request.
>>>
>>> /v3/s3tokens – No policy until this week due to OSSA-2025-002 [2]
>>> security issue, this endpoint will now be enforced
>>> in future release (including for stable releases!) to require admin or
>>> service role for the rgw_keystone_admin_user in
>>> Keystone [3].
>>>
>>> /v3/users/<user>/OS-EC2/<credential> – Policy enforcement to retrieve
>>> _other_ peoples EC2 credentials says
>>> this must have the admin role (see identity:ec2_get_credential in [4]).
>>> I’m working on a proposal in Keystone [5] to make
>>> the policy allow both admin and service roles. My proposal in [5] also
>>> includes the same changes for identity:get_credential
>>> due to a pending PR [6] that might change this API request.
>>>
>>> If my proposal [5] is merged this would allow us to remove the admin
>>> role from the configured rgw_keystone_admin_user
>>> user and only use the `service` role, the service role also has some
>>> elevated permissions and can do some damage but
>>> it’s atleast not a complete admin on the entire cloud.
>>>
>>> Hope this helps.
>>>
>>> /Tobias
>>>
>>> [1] https://github.com/ceph/ceph/pull/60515
>>> [2] https://security.openstack.org/ossa/OSSA-2025-002.html
>>> [3] https://review.opendev.org/c/openstack/keystone/+/966069
>>> [4] https://docs.openstack.org/keystone/latest/configuration/policy.html
>>> [5] https://review.opendev.org/c/openstack/keystone/+/966189
>>> [6] https://github.com/ceph/ceph/pull/63283
>>>
>>> On 5 Nov 2025, at 17:31, Boris <[email protected]> wrote:
>>>
>>> Hi Tobias,
>>>
>>> I just pumped up the rgw_debug to 20 and generated new output:
>>> https://pastebin.com/PcSUSWGY
>>> I hope that I redacted all the sensitive data. :)
>>>
>>> 3 Requests to list all my buckets in <10 seconds.
>>> The 1st request showd me my buckets, then 2nd requests resulted in a 500
>>> error and thew 3rd showed me my buckets again.
>>>
>>> For me this currently looks like I get a "429 Too Many Requests" from
>>> the keystone on all the three requests that I made and I would have
>>> expected to see this error only on the 2nd requests.
>>> Weird is also line 104-109. I have no idea how the content of the
>>> /etc/hosts file made it into the log.
>>>
>>> The keystone user that we have in the "rgw_keystone_admin_user" is not a
>>> keystone admin. The people that maintain the keystone just told me "The
>>> user doesn't have admin and we would not grant it."
>>> The "rgw_s3_auth_order" is default. We didn't touch it. "sts, external,
>>> local"
>>>
>>>
>>> Am Mi., 5. Nov. 2025 um 16:32 Uhr schrieb Tobias Urdin - Binero <
>>> [email protected]>:
>>>
>>>> Hello Boris,
>>>>
>>>> What roles is assigned to the Keystone user configured in
>>>> rgw_keystone_admin_user? It needs the
>>>> admin role in order to be allowed the
>>>> /v3/users/<user_id>/credentials/OS-EC2/<access_key> API request.
>>>>
>>>>     openstack role assignment list —names —user
>>>> <rgw_keystone_admin_user value>
>>>>
>>>> A part from that I don’t understand the “2nd request failed” part as
>>>> that seems to be from the LocalEngine
>>>> and is not related to Keystone, if you have the default value
>>>> for rgw_s3_auth_order the only thing I can
>>>> think off is that there is a bug or you’re missing some patch like [1]
>>>> [2] but that’s just a guess.
>>>>
>>>> /Tobias
>>>>
>>>> [1] https://github.com/ceph/ceph/pull/53846
>>>> [2] https://github.com/ceph/ceph/pull/53680
>>>>
>>>>
>>>> On 4 Nov 2025, at 11:32, Boris <[email protected]> wrote:
>>>>
>>>> I've created an upstream ticket
>>>> https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftracker.ceph.com%2Fissues%2F73709&data=05%7C02%7Ctobias.urdin%40binero.com%7C17fa249f3ee94151d37a08de1b8d8e9f%7C89d97f28180f459da0e585855aa63f6c%7C0%7C0%7C638978491958438817%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=MVedgbK0xyCFJY%2FuA%2FskKoY1VwBv6ikMrfVCjT9f%2Bro%3D&reserved=0
>>>> <https://tracker.ceph.com/issues/73709>
>>>>
>>>> Am Mo., 3. Nov. 2025 um 17:13 Uhr schrieb Boris <[email protected]>:
>>>>
>>>> yes, via ceph orch.
>>>>
>>>> ---
>>>> service_type: rgw
>>>> service_id: eu-central-lz
>>>> service_name: rgw.eu-central-lz
>>>> placement:
>>>>  count_per_host: 1
>>>>  label: rgw
>>>> spec:
>>>>  config:
>>>>    debug_rgw: 0
>>>>    rgw_dns_name: s3.eu-central-lz.tld
>>>>    rgw_dns_s3website_name: s3-website.eu-central-lz.tld
>>>>    rgw_keystone_token_cache_size: 100000
>>>>    rgw_thread_pool_size: 512
>>>>  rgw_frontend_port: 7480
>>>>  rgw_frontend_type: beast
>>>>  rgw_realm: ovh
>>>>  rgw_zone: eu-central-lz
>>>>  rgw_zonegroup: eu-central-lz
>>>>
>>>> Am Mo., 3. Nov. 2025 um 17:09 Uhr schrieb Anthony D'Atri <
>>>> [email protected]>:
>>>>
>>>> How is your RGW service deployed?  ceph orch?  Something else?
>>>>
>>>> On Nov 3, 2025, at 10:56 AM, Boris <[email protected]> wrote:
>>>>
>>>> Hi Anthony,
>>>> here are the config values we've set or with their defaults. There is
>>>> no rgw_keystone_token_cache_ttl (neither in the documentation, nor can I
>>>> set it via ceph config set client.rgw rgw_keystone_token_cache_ttl
>>>> 3600):
>>>>
>>>> ~# ceph config show-with-defaults rgw.rgw1 | grep rgw_keystone | column
>>>> -t
>>>> rgw_keystone_accepted_admin_roles            default
>>>>
>>>> rgw_keystone_accepted_roles                  objectstore_operator
>>>>      mon
>>>> rgw_keystone_admin_domain                    default
>>>>       mon
>>>> rgw_keystone_admin_password                  yyyyyyyy
>>>>      mon
>>>> rgw_keystone_admin_password_path             default
>>>>
>>>> rgw_keystone_admin_project                   services
>>>>      mon
>>>> rgw_keystone_admin_tenant                    default
>>>>
>>>> rgw_keystone_admin_token                     default
>>>>
>>>> rgw_keystone_admin_token_path                default
>>>>
>>>> rgw_keystone_admin_user                      xxxxxxx
>>>>       mon
>>>> rgw_keystone_api_version                     3
>>>>       mon
>>>> rgw_keystone_barbican_domain                 default
>>>>
>>>> rgw_keystone_barbican_password               default
>>>>
>>>> rgw_keystone_barbican_project                default
>>>>
>>>> rgw_keystone_barbican_tenant                 default
>>>>
>>>> rgw_keystone_barbican_user                   default
>>>>
>>>> rgw_keystone_expired_token_cache_expiration  3600
>>>>      default
>>>> rgw_keystone_implicit_tenants                false
>>>>       default
>>>> rgw_keystone_service_token_accepted_roles    admin
>>>>       default
>>>> rgw_keystone_service_token_enabled           false
>>>>       default
>>>> rgw_keystone_token_cache_size                100000
>>>>      mon         <-- i've set this to test if this solves the problem,
>>>> but
>>>> this is the default value
>>>> rgw_keystone_url
>>>> https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fauth.tld%2F&data=05%7C02%7Ctobias.urdin%40binero.com%7C17fa249f3ee94151d37a08de1b8d8e9f%7C89d97f28180f459da0e585855aa63f6c%7C0%7C0%7C638978491958459086%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=jNFs4KDcToTVmzLnSEg0lgDIKHLtO6yt5zYTE6fpWao%3D&reserved=0
>>>> <https://auth.tld/>
>>>>        mon
>>>> rgw_keystone_verify_ssl                      true
>>>>      default
>>>>
>>>>
>>>>
>>>> Am Mo., 3. Nov. 2025 um 16:40 Uhr schrieb Anthony D'Atri <
>>>> [email protected]>:
>>>>
>>>> Check the values of rgw_keystone_token_cache_size and
>>>> rgw_keystone_token_cache_ttl and other rgw_keystone options.
>>>>
>>>> I've seen at least one deployment tool that disabled Keystone caching
>>>> for dev purposes, but leaked that into the release code, which deployed
>>>> RGW
>>>> with Rook with a configmap override.
>>>>
>>>>
>>>> On Nov 3, 2025, at 9:52 AM, Boris <[email protected]> wrote:
>>>>
>>>> Hi,
>>>> I am currently debugging a problem that the radosgw keystone token
>>>>
>>>> cache
>>>>
>>>> seems not to work properly. Or at all. I tried to debug it and
>>>>
>>>> attached the
>>>>
>>>> rgw_debug log set to 10. I've truncated to only show the part from "No
>>>> stored secret string, cache miss" until the request is done.
>>>>
>>>> The failed request hits a rate limit on the keystone which currently
>>>>
>>>> takes
>>>>
>>>> around 2k answered requests per minute.
>>>> Any ideas what I did wrong?
>>>>
>>>> * All requests were done within 10 seconds and were only an ls to show
>>>> buckets.
>>>> * This particular RGW only took my requests during testing.
>>>> * We didn't set any timeouts or special cache configs in ceph
>>>> * system time is correct
>>>>
>>>>
>>>> First request worked instantly:
>>>>
>>>> req 8122732607072897744 0.106001295s s3:list_buckets No stored secret
>>>> string, cache miss
>>>> [4.0K blob data]
>>>> req 8122732607072897744 0.315003842s s3:list_buckets s3 keystone:
>>>>
>>>> validated
>>>>
>>>> token: 8144848695793469:user-9XGYcbFNUVTQ expires: 1762266594
>>>> req 8122732607072897744 0.315003842s s3:list_buckets cache get:
>>>>
>>>>
>>>> name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee$a13f0472be744104ad1f64bb2855cdee
>>>>
>>>> : hit (negative entry)
>>>> req 8122732607072897744 0.315003842s s3:list_buckets cache get:
>>>> name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee
>>>>
>>>> :
>>>>
>>>> hit (requested=0x13, cached=0x13)
>>>> req 8122732607072897744 0.315003842s s3:list_buckets normalizing
>>>>
>>>> buckets
>>>>
>>>> and tenants
>>>> req 8122732607072897744 0.315003842s s->object=<NULL> s->bucket=
>>>> req 8122732607072897744 0.315003842s s3:list_buckets init permissions
>>>> req 8122732607072897744 0.315003842s s3:list_buckets cache get:
>>>> name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee
>>>>
>>>> :
>>>>
>>>> hit (requested=0x13, cached=0x13)
>>>> req 8122732607072897744 0.315003842s s3:list_buckets recalculating
>>>>
>>>> target
>>>>
>>>> req 8122732607072897744 0.315003842s s3:list_buckets reading
>>>>
>>>> permissions
>>>>
>>>> req 8122732607072897744 0.315003842s s3:list_buckets init op
>>>> req 8122732607072897744 0.315003842s s3:list_buckets verifying op mask
>>>> req 8122732607072897744 0.315003842s s3:list_buckets verifying op
>>>> permissions
>>>> req 8122732607072897744 0.315003842s s3:list_buckets verifying op
>>>>
>>>> params
>>>>
>>>> req 8122732607072897744 0.315003842s s3:list_buckets pre-executing
>>>> req 8122732607072897744 0.315003842s s3:list_buckets check rate
>>>>
>>>> limiting
>>>>
>>>> req 8122732607072897744 0.315003842s s3:list_buckets executing
>>>> req 8122732607072897744 0.315003842s s3:list_buckets completing
>>>> req 8122732607072897744 0.315003842s cache get:
>>>> name=eu-central-lz.rgw.log++script.postrequest. : hit (negative entry)
>>>> req 8122732607072897744 0.315003842s s3:list_buckets op status=0
>>>> req 8122732607072897744 0.315003842s s3:list_buckets http status=200
>>>> ====== req done req=0x74659e51b6f0 op status=0 http_status=200
>>>> latency=0.315003842s ======
>>>>
>>>> 2nd request failed
>>>>
>>>> req 10422983006485317789 0.061000749s s3:list_buckets cache get:
>>>>
>>>> name=eu-central-lz.rgw.meta+users.keys+05917cf2ee9d4fdea8baf6a3348ca33a
>>>> :
>>>>
>>>> hit (negative entry)
>>>> req 10422983006485317789 0.061000749s s3:list_buckets error reading
>>>>
>>>> user
>>>>
>>>> info, uid=05917cf2ee9d4fdea8baf6a3348ca33a can't authenticate
>>>> req 10422983006485317789 0.061000749s s3:list_buckets Failed the auth
>>>> strategy, reason=-5
>>>> failed to authorize request
>>>> WARNING: set_req_state_err err_no=5 resorting to 500
>>>> req 10422983006485317789 0.061000749s cache get:
>>>> name=eu-central-lz.rgw.log++script.postrequest. : hit (negative entry)
>>>> req 10422983006485317789 0.061000749s s3:list_buckets op status=0
>>>> req 10422983006485317789 0.061000749s s3:list_buckets http status=500
>>>> ====== req done req=0x74659e51b6f0 op status=0 http_status=500
>>>> latency=0.061000749s ======
>>>>
>>>> 3rd requests went through again
>>>>
>>>> req 13123970335019889535 0.000000000s s3:list_buckets No stored secret
>>>> string, cache miss
>>>> [250B blob data]
>>>> req 13123970335019889535 0.204002500s s3:list_buckets s3 keystone:
>>>> validated token: 8144848695793469:user-9XGYcbFNUVTQ expires: 1762266602
>>>> req 13123970335019889535 0.204002500s s3:list_buckets cache get:
>>>>
>>>>
>>>> name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee$a13f0472be744104ad1f64bb2855cdee
>>>>
>>>> : hit (negative entry)
>>>> req 13123970335019889535 0.204002500s s3:list_buckets cache get:
>>>> name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee
>>>>
>>>> :
>>>>
>>>> hit (requested=0x13, cached=0x13)
>>>> req 13123970335019889535 0.204002500s s3:list_buckets normalizing
>>>>
>>>> buckets
>>>>
>>>> and tenants
>>>> req 13123970335019889535 0.204002500s s->object=<NULL> s->bucket=
>>>> req 13123970335019889535 0.204002500s s3:list_buckets init permissions
>>>> req 13123970335019889535 0.204002500s s3:list_buckets cache get:
>>>> name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee
>>>>
>>>> :
>>>>
>>>> hit (requested=0x13, cached=0x13)
>>>> req 13123970335019889535 0.204002500s s3:list_buckets recalculating
>>>>
>>>> target
>>>>
>>>> req 13123970335019889535 0.204002500s s3:list_buckets reading
>>>>
>>>> permissions
>>>>
>>>> req 13123970335019889535 0.204002500s s3:list_buckets init op
>>>> req 13123970335019889535 0.204002500s s3:list_buckets verifying op mask
>>>> req 13123970335019889535 0.204002500s s3:list_buckets verifying op
>>>> permissions
>>>> req 13123970335019889535 0.204002500s s3:list_buckets verifying op
>>>>
>>>> params
>>>>
>>>> req 13123970335019889535 0.204002500s s3:list_buckets pre-executing
>>>> req 13123970335019889535 0.204002500s s3:list_buckets check rate
>>>>
>>>> limiting
>>>>
>>>> req 13123970335019889535 0.204002500s s3:list_buckets executing
>>>> req 13123970335019889535 0.204002500s s3:list_buckets completing
>>>> req 13123970335019889535 0.204002500s cache get:
>>>> name=eu-central-lz.rgw.log++script.postrequest. : hit (negative entry)
>>>> req 13123970335019889535 0.204002500s s3:list_buckets op status=0
>>>> req 13123970335019889535 0.204002500s s3:list_buckets http status=200
>>>> ====== req done req=0x74659e51b6f0 op status=0 http_status=200
>>>> latency=0.204002500s ======
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
>>>>
>>>> im
>>>>
>>>> groüen Saal.
>>>> _______________________________________________
>>>> ceph-users mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>>>> groüen Saal.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>>>> groüen Saal.
>>>>
>>>>
>>>>
>>>> --
>>>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>>>> groüen Saal.
>>>> _______________________________________________
>>>> ceph-users mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>>
>>>>
>>>>
>>>
>>> --
>>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>>> groüen Saal.
>>>
>>>
>>>
>>
>> --
>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>> groüen Saal.
>>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
>
>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to