[ceph-users] Re: [reef-18.2.7] radosgw token cache for keystone not working

Tobias Urdin - Binero Fri, 07 Nov 2025 02:58:15 -0800

Hello Boris,

What you’re missing in your flow is accounting for the AWS v4 signature
algorithm used for S3, you can look it up to understand it better.

I will try to walk through the flow but will grossly oversimplify it so that
it’s easier to follow in bullet list form.

1. Client prepares a HTTP request, computes a hash of multiple things
contained in the request combined with the secret key (giving us: the 
signature) and
adds that to the headers with the access key.

2. RadosGW receieves the request and sends the signature to Keystone’s 
/v3/s3tokens
endpoint which verifies that the the access key exists and that the signature 
is the
same if it’s computed with the secret key stored in Keystone.

The request is now authenticated, Keystone says it’s OK, but RadosGW has
a problem, it doesn’t know the secret key so we cannot cache anything and
must talk to Keystone on each request, so...

3. RadosGW does a request against Keystone /v3/users/<user>/OS-EC2/<credential>
endpoint, it knows the user because step #2 returns a Keystone token containing 
the
info and the credential is the access key. If this request is successful 
RadosGW now
has the secret key and adds it in the secret cache.

Now on the next request to RadosGW it will lookup the secret key in the secret 
cache
keyed by the access key, we now have the secret key and RadosGW itself can 
perform
step #2 internally by computing the signature and verify the request.

Now if #3 fails – you end up in the state you were, it does requests against 
Keystone
on each request and it cannot populate the secret cache.

This is a good example of a chicken and egg problem with caching, we could fail 
the
request because we could not cache it but it’s already authenticated, perhaps 
the
Keystone error was temporary and we can try again on next iteration, or should 
we
drop the request even though we could perform it. It’s semantics on how we 
should
handle the result of the attempt to populate the secret cache, both scenarios 
is valid, so
maybe a config option :)

Personally I would prefer to reject the request, but that is not what’s done 
today. When
the request in #3 fails it should log a "s3 keystone: secret fetching error: 
%s” error in
the RadosGW log if config option debug_rgw >= 2.

Hope that helps.

/Tobias

On 7 Nov 2025, at 10:07, Boris <[email protected]> wrote:

What I still don't understand:

If the request to get the ec2 credentials from keystone ran into a 429, why did 
it work the first time?
How does the keystone authentication work for radosgw under the hood? Is there 
a documentation I can read up on?

I would have thought it was something like:
1. Authenticate with the keystone with a special set of credentials 
(rgw_keystone_admin_user)
2. Fetch the EC2 credentials for the provided access key
3. Save those credentials for the time that keystone told with the token 
lifetime
4. Do the normal s3 authentication with the cached credentials

I would have thought that these tokens live in the memory of the radosgw daemon 
and every rgw daemon keeps track on it's own.
But as I am writing this down I think "what happens to the cached credentials 
if the ec2 credentials will be invalidated by the end user?"

btw: This is what our keystone team told me:
>Tobias is top notch, he knows both keystone and ceph, he has some comments on 
>the recent keystone cve we patched: 
>launchpad.net/bugs/2119646<http://launchpad.net/bugs/2119646>
>What he says makes sense. We already have the new policy that comes with the 
>OSSA patch on /v3/s3tokens and you have it.
>If you want you can also tell him that we fixed the issue not by granting 
>admin but by granting /v3/users/<user>/OS-EC2/<credential> via a custom role 
>policy.

Am Do., 6. Nov. 2025 um 15:45 Uhr schrieb Boris 
<[email protected]<mailto:[email protected]>>:
Hi Tobias,

thanks a lot for the in depth explanation. The keystone team fixed something 
yesterday regarding the mentioned bug and now the 429s are gone.
We still have no clue why rgw worked at all and I sill try to understand it.

Do you attend the ceph days in Berlin next week?

Am Do., 6. Nov. 2025 um 08:26 Uhr schrieb Tobias Urdin - Binero 
<[email protected]<mailto:[email protected]>>:
Hello Boris,

Then that is probably your issue, ask the team maintaining OpenStack Keystone 
to check the logs for requests
to that API endpoint that is failing with 403 on every request from RadosGW, 
similiar to this:

    "GET /v3/users/<user>/credentials/OS-EC2/<credential> HTTP/1.1" 403 140 "-" 
"-"

What this means is that your authentication will work but because RadosGW 
cannot retrieve the EC2 credential
secret and will not populate the cache and you will do authentication against 
Keystone on each request.

---

Let me try to clear things up a bit, hopefully. RadosGW needs to perform these 
API requests against Keystone:

/v3/auth/tokens – No policy enforce on who can talk to this API 
(rgw_keystone_admin_user does not need any
special role). Patches and backports [1] has been done to simply drop the admin 
token usage in this API request.

/v3/s3tokens – No policy until this week due to OSSA-2025-002 [2] security 
issue, this endpoint will now be enforced
in future release (including for stable releases!) to require admin or service 
role for the rgw_keystone_admin_user in
Keystone [3].

/v3/users/<user>/OS-EC2/<credential> – Policy enforcement to retrieve _other_ 
peoples EC2 credentials says
this must have the admin role (see identity:ec2_get_credential in [4]). I’m 
working on a proposal in Keystone [5] to make
the policy allow both admin and service roles. My proposal in [5] also includes 
the same changes for identity:get_credential
due to a pending PR [6] that might change this API request.

If my proposal [5] is merged this would allow us to remove the admin role from 
the configured rgw_keystone_admin_user
user and only use the `service` role, the service role also has some elevated 
permissions and can do some damage but
it’s atleast not a complete admin on the entire cloud.

Hope this helps.

/Tobias

[1] https://github.com/ceph/ceph/pull/60515
[2] https://security.openstack.org/ossa/OSSA-2025-002.html
[3] https://review.opendev.org/c/openstack/keystone/+/966069
[4] https://docs.openstack.org/keystone/latest/configuration/policy.html
[5] https://review.opendev.org/c/openstack/keystone/+/966189
[6] https://github.com/ceph/ceph/pull/63283

On 5 Nov 2025, at 17:31, Boris <[email protected]<mailto:[email protected]>> wrote:

Hi Tobias,

I just pumped up the rgw_debug to 20 and generated new output: 
https://pastebin.com/PcSUSWGY
I hope that I redacted all the sensitive data. :)

3 Requests to list all my buckets in <10 seconds.
The 1st request showd me my buckets, then 2nd requests resulted in a 500 error 
and thew 3rd showed me my buckets again.

For me this currently looks like I get a "429 Too Many Requests" from the 
keystone on all the three requests that I made and I would have expected to see 
this error only on the 2nd requests.
Weird is also line 104-109. I have no idea how the content of the /etc/hosts 
file made it into the log.

The keystone user that we have in the "rgw_keystone_admin_user" is not a 
keystone admin. The people that maintain the keystone just told me "The user 
doesn't have admin and we would not grant it."
The "rgw_s3_auth_order" is default. We didn't touch it. "sts, external, local"

Am Mi., 5. Nov. 2025 um 16:32 Uhr schrieb Tobias Urdin - Binero 
<[email protected]<mailto:[email protected]>>:
Hello Boris,

What roles is assigned to the Keystone user configured in 
rgw_keystone_admin_user? It needs the
admin role in order to be allowed the 
/v3/users/<user_id>/credentials/OS-EC2/<access_key> API request.

    openstack role assignment list —names —user <rgw_keystone_admin_user value>

A part from that I don’t understand the “2nd request failed” part as that seems 
to be from the LocalEngine
and is not related to Keystone, if you have the default value for 
rgw_s3_auth_order the only thing I can
think off is that there is a bug or you’re missing some patch like [1] [2] but 
that’s just a guess.

/Tobias

[1] https://github.com/ceph/ceph/pull/53846
[2] https://github.com/ceph/ceph/pull/53680

On 4 Nov 2025, at 11:32, Boris <[email protected]<mailto:[email protected]>> wrote:

I've created an upstream ticket 
https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftracker.ceph.com%2Fissues%2F73709&data=05%7C02%7Ctobias.urdin%40binero.com%7C17fa249f3ee94151d37a08de1b8d8e9f%7C89d97f28180f459da0e585855aa63f6c%7C0%7C0%7C638978491958438817%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=MVedgbK0xyCFJY%2FuA%2FskKoY1VwBv6ikMrfVCjT9f%2Bro%3D&reserved=0<https://tracker.ceph.com/issues/73709>

Am Mo., 3. Nov. 2025 um 17:13 Uhr schrieb Boris 
<[email protected]<mailto:[email protected]>>:

yes, via ceph orch.

---
service_type: rgw
service_id: eu-central-lz
service_name: rgw.eu-central-lz
placement:
 count_per_host: 1
 label: rgw
spec:
 config:
   debug_rgw: 0
   rgw_dns_name: s3.eu-central-lz.tld
   rgw_dns_s3website_name: s3-website.eu-central-lz.tld
   rgw_keystone_token_cache_size: 100000
   rgw_thread_pool_size: 512
 rgw_frontend_port: 7480
 rgw_frontend_type: beast
 rgw_realm: ovh
 rgw_zone: eu-central-lz
 rgw_zonegroup: eu-central-lz

Am Mo., 3. Nov. 2025 um 17:09 Uhr schrieb Anthony D'Atri <
[email protected]<mailto:[email protected]>>:

How is your RGW service deployed?  ceph orch?  Something else?

On Nov 3, 2025, at 10:56 AM, Boris <[email protected]<mailto:[email protected]>> 
wrote:

Hi Anthony,
here are the config values we've set or with their defaults. There is
no rgw_keystone_token_cache_ttl (neither in the documentation, nor can I
set it via ceph config set client.rgw rgw_keystone_token_cache_ttl 3600):

~# ceph config show-with-defaults rgw.rgw1 | grep rgw_keystone | column -t
rgw_keystone_accepted_admin_roles            default

rgw_keystone_accepted_roles                  objectstore_operator
     mon
rgw_keystone_admin_domain                    default
      mon
rgw_keystone_admin_password                  yyyyyyyy
     mon
rgw_keystone_admin_password_path             default

rgw_keystone_admin_project                   services
     mon
rgw_keystone_admin_tenant                    default

rgw_keystone_admin_token                     default

rgw_keystone_admin_token_path                default

rgw_keystone_admin_user                      xxxxxxx
      mon
rgw_keystone_api_version                     3
      mon
rgw_keystone_barbican_domain                 default

rgw_keystone_barbican_password               default

rgw_keystone_barbican_project                default

rgw_keystone_barbican_tenant                 default

rgw_keystone_barbican_user                   default

rgw_keystone_expired_token_cache_expiration  3600
     default
rgw_keystone_implicit_tenants                false
      default
rgw_keystone_service_token_accepted_roles    admin
      default
rgw_keystone_service_token_enabled           false
      default
rgw_keystone_token_cache_size                100000
     mon         <-- i've set this to test if this solves the problem, but
this is the default value
rgw_keystone_url                             
https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fauth.tld%2F&data=05%7C02%7Ctobias.urdin%40binero.com%7C17fa249f3ee94151d37a08de1b8d8e9f%7C89d97f28180f459da0e585855aa63f6c%7C0%7C0%7C638978491958459086%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=jNFs4KDcToTVmzLnSEg0lgDIKHLtO6yt5zYTE6fpWao%3D&reserved=0<https://auth.tld/>
       mon
rgw_keystone_verify_ssl                      true
     default

Am Mo., 3. Nov. 2025 um 16:40 Uhr schrieb Anthony D'Atri <
[email protected]<mailto:[email protected]>>:

Check the values of rgw_keystone_token_cache_size and
rgw_keystone_token_cache_ttl and other rgw_keystone options.

I've seen at least one deployment tool that disabled Keystone caching
for dev purposes, but leaked that into the release code, which deployed RGW
with Rook with a configmap override.

On Nov 3, 2025, at 9:52 AM, Boris <[email protected]<mailto:[email protected]>> wrote:

Hi,
I am currently debugging a problem that the radosgw keystone token
cache
seems not to work properly. Or at all. I tried to debug it and
attached the
rgw_debug log set to 10. I've truncated to only show the part from "No
stored secret string, cache miss" until the request is done.

The failed request hits a rate limit on the keystone which currently
takes
around 2k answered requests per minute.
Any ideas what I did wrong?

* All requests were done within 10 seconds and were only an ls to show
buckets.
* This particular RGW only took my requests during testing.
* We didn't set any timeouts or special cache configs in ceph
* system time is correct

First request worked instantly:

req 8122732607072897744 0.106001295s s3:list_buckets No stored secret
string, cache miss
[4.0K blob data]
req 8122732607072897744 0.315003842s s3:list_buckets s3 keystone:
validated
token: 8144848695793469:user-9XGYcbFNUVTQ expires: 1762266594
req 8122732607072897744 0.315003842s s3:list_buckets cache get:

name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee$a13f0472be744104ad1f64bb2855cdee
: hit (negative entry)
req 8122732607072897744 0.315003842s s3:list_buckets cache get:
name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee
:
hit (requested=0x13, cached=0x13)
req 8122732607072897744 0.315003842s s3:list_buckets normalizing
buckets
and tenants
req 8122732607072897744 0.315003842s s->object=<NULL> s->bucket=
req 8122732607072897744 0.315003842s s3:list_buckets init permissions
req 8122732607072897744 0.315003842s s3:list_buckets cache get:
name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee
:
hit (requested=0x13, cached=0x13)
req 8122732607072897744 0.315003842s s3:list_buckets recalculating
target
req 8122732607072897744 0.315003842s s3:list_buckets reading
permissions
req 8122732607072897744 0.315003842s s3:list_buckets init op
req 8122732607072897744 0.315003842s s3:list_buckets verifying op mask
req 8122732607072897744 0.315003842s s3:list_buckets verifying op
permissions
req 8122732607072897744 0.315003842s s3:list_buckets verifying op
params
req 8122732607072897744 0.315003842s s3:list_buckets pre-executing
req 8122732607072897744 0.315003842s s3:list_buckets check rate
limiting
req 8122732607072897744 0.315003842s s3:list_buckets executing
req 8122732607072897744 0.315003842s s3:list_buckets completing
req 8122732607072897744 0.315003842s cache get:
name=eu-central-lz.rgw.log++script.postrequest. : hit (negative entry)
req 8122732607072897744 0.315003842s s3:list_buckets op status=0
req 8122732607072897744 0.315003842s s3:list_buckets http status=200
====== req done req=0x74659e51b6f0 op status=0 http_status=200
latency=0.315003842s ======

2nd request failed

req 10422983006485317789 0.061000749s s3:list_buckets cache get:

name=eu-central-lz.rgw.meta+users.keys+05917cf2ee9d4fdea8baf6a3348ca33a :
hit (negative entry)
req 10422983006485317789 0.061000749s s3:list_buckets error reading
user
info, uid=05917cf2ee9d4fdea8baf6a3348ca33a can't authenticate
req 10422983006485317789 0.061000749s s3:list_buckets Failed the auth
strategy, reason=-5
failed to authorize request
WARNING: set_req_state_err err_no=5 resorting to 500
req 10422983006485317789 0.061000749s cache get:
name=eu-central-lz.rgw.log++script.postrequest. : hit (negative entry)
req 10422983006485317789 0.061000749s s3:list_buckets op status=0
req 10422983006485317789 0.061000749s s3:list_buckets http status=500
====== req done req=0x74659e51b6f0 op status=0 http_status=500
latency=0.061000749s ======

3rd requests went through again

req 13123970335019889535 0.000000000s s3:list_buckets No stored secret
string, cache miss
[250B blob data]
req 13123970335019889535 0.204002500s s3:list_buckets s3 keystone:
validated token: 8144848695793469:user-9XGYcbFNUVTQ expires: 1762266602
req 13123970335019889535 0.204002500s s3:list_buckets cache get:

name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee$a13f0472be744104ad1f64bb2855cdee
: hit (negative entry)
req 13123970335019889535 0.204002500s s3:list_buckets cache get:
name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee
:
hit (requested=0x13, cached=0x13)
req 13123970335019889535 0.204002500s s3:list_buckets normalizing
buckets
and tenants
req 13123970335019889535 0.204002500s s->object=<NULL> s->bucket=
req 13123970335019889535 0.204002500s s3:list_buckets init permissions
req 13123970335019889535 0.204002500s s3:list_buckets cache get:
name=eu-central-lz.rgw.meta+users.uid+a13f0472be744104ad1f64bb2855cdee
:
hit (requested=0x13, cached=0x13)
req 13123970335019889535 0.204002500s s3:list_buckets recalculating
target
req 13123970335019889535 0.204002500s s3:list_buckets reading
permissions
req 13123970335019889535 0.204002500s s3:list_buckets init op
req 13123970335019889535 0.204002500s s3:list_buckets verifying op mask
req 13123970335019889535 0.204002500s s3:list_buckets verifying op
permissions
req 13123970335019889535 0.204002500s s3:list_buckets verifying op
params
req 13123970335019889535 0.204002500s s3:list_buckets pre-executing
req 13123970335019889535 0.204002500s s3:list_buckets check rate
limiting
req 13123970335019889535 0.204002500s s3:list_buckets executing
req 13123970335019889535 0.204002500s s3:list_buckets completing
req 13123970335019889535 0.204002500s cache get:
name=eu-central-lz.rgw.log++script.postrequest. : hit (negative entry)
req 13123970335019889535 0.204002500s s3:list_buckets op status=0
req 13123970335019889535 0.204002500s s3:list_buckets http status=200
====== req done req=0x74659e51b6f0 op status=0 http_status=200
latency=0.204002500s ======

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- [email protected]<mailto:[email protected]>
To unsubscribe send an email to 
[email protected]<mailto:[email protected]>

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- [email protected]<mailto:[email protected]>
To unsubscribe send an email to 
[email protected]<mailto:[email protected]>

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im 
groÃƒ¼en Saal.

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im 
groÃƒ¼en Saal.

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im 
groÃƒ¼en Saal.

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: [reef-18.2.7] radosgw token cache for keystone not working

Reply via email to