Looks like the issue was fixed in the latest reef release (18.2.4)

I found the following commit that seams to fix it:
https://github.com/ceph/ceph/commit/26f1d6614bbc45a0079608718f191f94bd4eebb6

After upgrading we also haven’t encountered the problem again.


Cheers,
Florian

> On 5. Aug 2024, at 14:38, Florian Schwab <fsch...@impossiblecloud.com> wrote:
> 
> Hi Alex,
> 
> thank you for the script. We will monitor how the queue fills ups to see if 
> this is the issue or not.
> 
> 
> Cheers,
> Florian
> 
>> On 5. Aug 2024, at 14:01, Alex Hussein-Kershaw (HE/HIM) 
>> <alex...@microsoft.com> wrote:
>> 
>> Hi Florian,
>> 
>> We are also gearing up to use persistent bucket notifications, but have not 
>> got as far as you yet so quite interested in this. As I understand it, a 
>> bunch of new function is coming in Squid on the radosgw-admin command to 
>> allow gathering metrics from the queues, but they are not available yet in 
>> Reef.
>> 
>> I've used this: parse-notifications.py (github.com) 
>> <https://gist.github.com/yuvalif/b44a67b6278fe811aa38dd81a91eb3ba> to parse 
>> all the objects in the queue, hopefully it helps you (credit to Yuval who 
>> wrote it). The reservation failure to me does look like the queue is full. 
>> It would surely be interesting to see what is in the queue. 
>> 
>> Best wishes,
>> Alex
>> 
>> From: Florian Schwab <fsch...@impossiblecloud.com 
>> <mailto:fsch...@impossiblecloud.com>>
>> Sent: Monday, August 5, 2024 11:02 AM
>> To: ceph-users@ceph.io <mailto:ceph-users@ceph.io> <ceph-users@ceph.io 
>> <mailto:ceph-users@ceph.io>>
>> Subject: [EXTERNAL] [ceph-users] RGW bucket notifications stop working after 
>> a while and blocking requests
>>  
>> [You don't often get email from fsch...@impossiblecloud.com 
>> <mailto:fsch...@impossiblecloud.com>. Learn why this is important at 
>> https://aka.ms/LearnAboutSenderIdentification ]
>> 
>> Hi,
>> 
>> we just set up 2 new ceph clusters (using rook). To do some processing of 
>> the user activity we configured a topic that sends events to Kafka.
>> 
>> After 5-12 hours this stops working with a 503 SlowDown response:
>> debug 2024-08-02T09:17:58.205+0000 7ff4359ad700 1 req 13681579273117692719 
>> 0.005000019s ERROR: failed to reserve notification on queue: private.rgw. 
>> error: -28
>> 
>> First thought would be that the queue is full but up to this point see 
>> messages coming into Kafka and without much activity on the RGW itself (only 
>> a few requests against the S3 API) so it can’t be a load issue.
>> 
>> What helps is to remove the notification configuration on the buckets 
>> (put-bucket-notification-configuration). If we directly re-add the previous 
>> notification configuration it also continuous working for a few hours before 
>> failing again with the same error/behaviour.
>> 
>> We haven’t been able to reproduce this if we disable persistence for the 
>> topic so it looks like it is related to the persistence option - otherwise 
>> there would be also no queuing of the event for sending to Kafka.
>> This also suggests that the issue is not with Kafka - this is also what we 
>> suspected first e.g. it can’t handle the amount of messages etc.
>> 
>> Does anyone else have or had this issue and found the cause or a suggestion 
>> on how to best continue debugging? Are there detailed metrics etc. on the 
>> size and usage of the event queue?
>> 
>> 
>> Here is the configuration for the topic and for a bucket:
>> 
>> $ radosgw-admin topic list
>> {
>>    "topics": [
>>        {
>>            "user": "",
>>            "name": "private.rgw",
>>            "dest": {
>>                "push_endpoint": 
>> "kafka://rgw-sasl-kafka-user:x...@kafka-kafka-bootstrap.kafka.svc:9094/private.rgw?sasl.mechanism=SCRAM-SHA-512&mechanism=SCRAM-SHA-512",
>>                "push_endpoint_args": 
>> "OpaqueData=&Version=2010-03-31&kafka-ack-level=broker&persistent=false&push-endpoint=kafka://rgw-sasl-kafka-user:x...@kafka-kafka-bootstrap.kafka.svc:9094/private.rgw?sasl.mechanism=SCRAM-SHA-512&mechanism=SCRAM-SHA-512&use-ssl=true&verify-ssl=true",
>>                "push_endpoint_topic": "private.rgw",
>>                "stored_secret": true,
>>                "persistent": true
>>            },
>>            "arn": "arn:aws:sns:ceph-objectstore::private.rgw",
>>            "opaqueData": ""
>>        }
>>    ]
>> }
>> 
>> $ aws s3api get-bucket-notification-configuration --bucket=XXX
>> {
>>    "TopicConfigurations": [
>>        {
>>            "Id": “my-id",
>>            "TopicArn": "arn:aws:sns:ceph-objectstore::private.rgw",
>>            "Events": [
>>                "s3:ObjectCreated:*",
>>                "s3:ObjectRemoved:*"
>>            ]
>>        }
>>    ]
>> }
>> 
>> 
>> Thank you for any input to solve this!
>> 
>> 
>> Cheers,
>> Florian
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>> <mailto:ceph-users-le...@ceph.io>

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to