Looks like the issue was fixed in the latest reef release (18.2.4) I found the following commit that seams to fix it: https://github.com/ceph/ceph/commit/26f1d6614bbc45a0079608718f191f94bd4eebb6
After upgrading we also haven’t encountered the problem again. Cheers, Florian > On 5. Aug 2024, at 14:38, Florian Schwab <fsch...@impossiblecloud.com> wrote: > > Hi Alex, > > thank you for the script. We will monitor how the queue fills ups to see if > this is the issue or not. > > > Cheers, > Florian > >> On 5. Aug 2024, at 14:01, Alex Hussein-Kershaw (HE/HIM) >> <alex...@microsoft.com> wrote: >> >> Hi Florian, >> >> We are also gearing up to use persistent bucket notifications, but have not >> got as far as you yet so quite interested in this. As I understand it, a >> bunch of new function is coming in Squid on the radosgw-admin command to >> allow gathering metrics from the queues, but they are not available yet in >> Reef. >> >> I've used this: parse-notifications.py (github.com) >> <https://gist.github.com/yuvalif/b44a67b6278fe811aa38dd81a91eb3ba> to parse >> all the objects in the queue, hopefully it helps you (credit to Yuval who >> wrote it). The reservation failure to me does look like the queue is full. >> It would surely be interesting to see what is in the queue. >> >> Best wishes, >> Alex >> >> From: Florian Schwab <fsch...@impossiblecloud.com >> <mailto:fsch...@impossiblecloud.com>> >> Sent: Monday, August 5, 2024 11:02 AM >> To: ceph-users@ceph.io <mailto:ceph-users@ceph.io> <ceph-users@ceph.io >> <mailto:ceph-users@ceph.io>> >> Subject: [EXTERNAL] [ceph-users] RGW bucket notifications stop working after >> a while and blocking requests >> >> [You don't often get email from fsch...@impossiblecloud.com >> <mailto:fsch...@impossiblecloud.com>. Learn why this is important at >> https://aka.ms/LearnAboutSenderIdentification ] >> >> Hi, >> >> we just set up 2 new ceph clusters (using rook). To do some processing of >> the user activity we configured a topic that sends events to Kafka. >> >> After 5-12 hours this stops working with a 503 SlowDown response: >> debug 2024-08-02T09:17:58.205+0000 7ff4359ad700 1 req 13681579273117692719 >> 0.005000019s ERROR: failed to reserve notification on queue: private.rgw. >> error: -28 >> >> First thought would be that the queue is full but up to this point see >> messages coming into Kafka and without much activity on the RGW itself (only >> a few requests against the S3 API) so it can’t be a load issue. >> >> What helps is to remove the notification configuration on the buckets >> (put-bucket-notification-configuration). If we directly re-add the previous >> notification configuration it also continuous working for a few hours before >> failing again with the same error/behaviour. >> >> We haven’t been able to reproduce this if we disable persistence for the >> topic so it looks like it is related to the persistence option - otherwise >> there would be also no queuing of the event for sending to Kafka. >> This also suggests that the issue is not with Kafka - this is also what we >> suspected first e.g. it can’t handle the amount of messages etc. >> >> Does anyone else have or had this issue and found the cause or a suggestion >> on how to best continue debugging? Are there detailed metrics etc. on the >> size and usage of the event queue? >> >> >> Here is the configuration for the topic and for a bucket: >> >> $ radosgw-admin topic list >> { >> "topics": [ >> { >> "user": "", >> "name": "private.rgw", >> "dest": { >> "push_endpoint": >> "kafka://rgw-sasl-kafka-user:x...@kafka-kafka-bootstrap.kafka.svc:9094/private.rgw?sasl.mechanism=SCRAM-SHA-512&mechanism=SCRAM-SHA-512", >> "push_endpoint_args": >> "OpaqueData=&Version=2010-03-31&kafka-ack-level=broker&persistent=false&push-endpoint=kafka://rgw-sasl-kafka-user:x...@kafka-kafka-bootstrap.kafka.svc:9094/private.rgw?sasl.mechanism=SCRAM-SHA-512&mechanism=SCRAM-SHA-512&use-ssl=true&verify-ssl=true", >> "push_endpoint_topic": "private.rgw", >> "stored_secret": true, >> "persistent": true >> }, >> "arn": "arn:aws:sns:ceph-objectstore::private.rgw", >> "opaqueData": "" >> } >> ] >> } >> >> $ aws s3api get-bucket-notification-configuration --bucket=XXX >> { >> "TopicConfigurations": [ >> { >> "Id": “my-id", >> "TopicArn": "arn:aws:sns:ceph-objectstore::private.rgw", >> "Events": [ >> "s3:ObjectCreated:*", >> "s3:ObjectRemoved:*" >> ] >> } >> ] >> } >> >> >> Thank you for any input to solve this! >> >> >> Cheers, >> Florian >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> >> To unsubscribe send an email to ceph-users-le...@ceph.io >> <mailto:ceph-users-le...@ceph.io> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io