Hi, Looking at that tracker, I see that you were seeing some errors. I don't really see any errors that stick out to me. When I turn the logs up to 20, I see the following types of logs over and over:
> ... > 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624 > 4066.843017578s s3:list_bucket list_objects_ordered: skipping past > namespaced objects, including > "_multipart_Data/<OBJECTNAME>2~a9lvduIp8rv3_btbiGEfOgAC-EvF85n.36" > 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624 > 4066.843017578s s3:list_bucket list_objects_ordered: considering entry > _multipart_Data/<OBJECTNAME>2~a9lvduIp8rv3_btbiGEfOgAC-EvF85n.39 > 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624 > 4066.843017578s s3:list_bucket list_objects_ordered: skipping past > namespaced objects, including > "_multipart_Data/<OBJECTNAME>2~a9lvduIp8rv3_btbiGEfOgAC-EvF85n.39" > 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624 > 4066.843017578s s3:list_bucket list_objects_ordered: considering entry > _multipart_Data/<OBJECTNAME>2~a9lvduIp8rv3_btbiGEfOgAC-EvF85n.5 > 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624 > 4066.843017578s s3:list_bucket list_objects_ordered: skipping past > namespaced objects, including > "_multipart_Data/<OBJECTNAME>2~a9lvduIp8rv3_btbiGEfOgAC-EvF85n.5" > 2025-01-27T19:36:25.989+0000 7f657bb17640 10 req 3636357367581543624 > 4066.843017578s s3:list_bucket list_objects_ordered: end of outer loop, > truncated=1, count=0, attempt=4829 > 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624 > 4066.843017578s s3:list_bucket list_objects_ordered: starting attempt 4830 > 2025-01-27T19:36:25.989+0000 7f657bb17640 10 req 3636357367581543624 > 4066.843017578s s3:list_bucket cls_bucket_list_ordered: request from each > of 149 shard(s) for 1001 entries to get 1001 total entries > > ... > 2025-01-27T19:40:53.704+0000 7f64a716e640 20 req 8762048560071224804 > 5187.776367188s s3:list_bucket list_objects_ordered: skipping past > namespaced objects, including > "_multipart_Data/<OBJECTNAME2>2~XSOCVgidHCMfywQ8-Z3kestTOwfqJ--.6" > 2025-01-27T19:40:53.704+0000 7f64a716e640 20 req 8762048560071224804 > 5187.776367188s s3:list_bucket list_objects_ordered: considering entry > _multipart_Data/<OBJECTNAME2>2~XSOCVgidHCMfywQ8-Z3kestTOwfqJ--.7 > 2025-01-27T19:40:53.704+0000 7f64a716e640 20 req 8762048560071224804 > 5187.776367188s s3:list_bucket list_objects_ordered: skipping past > namespaced objects, including > "_multipart_Data/<OBJECTNAME2>2~XSOCVgidHCMfywQ8-Z3kestTOwfqJ--.7" > 2025-01-27T19:40:53.704+0000 7f64a716e640 10 req 8762048560071224804 > 5187.776367188s s3:list_bucket list_objects_ordered: end of outer loop, > truncated=1, count=0, attempt=9134 > 2025-01-27T19:40:53.704+0000 7f64a716e640 20 req 8762048560071224804 > 5187.776367188s s3:list_bucket list_objects_ordered: starting attempt 9135 > 2025-01-27T19:40:53.704+0000 7f64a716e640 10 req 8762048560071224804 > 5187.776367188s s3:list_bucket cls_bucket_list_ordered: request from each > of 149 shard(s) for 1001 entries to get 1001 total entries > > ... > 2025-01-27T19:40:53.093+0000 7f657bb17640 20 req 3636357367581543624 > 4333.946777344s s3:list_bucket cls_bucket_list_ordered: currently > processing > _multipart_Data/<OBJECTNAME3>.2~r5PVFujVZq_hkwZea-hrYgE13OI1MXP.19 from > shard 8 > 2025-01-27T19:40:53.093+0000 7f657bb17640 10 req 3636357367581543624 > 4333.946777344s s3:list_bucket cls_bucket_list_ordered: got > _multipart_Data/<OBJECTNAME3>.2~r5PVFujVZq_hkwZea-hrYgE13OI1MXP.19 > 2025-01-27T19:40:53.093+0000 7f657bb17640 20 req 3636357367581543624 > 4333.946777344s s3:list_bucket cls_bucket_list_ordered: currently > processing > _multipart_Data/<OBJECTNAME3>.2~r5PVFujVZq_hkwZea-hrYgE13OI1MXP.2 from > shard 8 > 2025-01-27T19:40:53.093+0000 7f657bb17640 10 req 3636357367581543624 > 4333.946777344s s3:list_bucket cls_bucket_list_ordered: got > _multipart_Data/<OBJECTNAME3>.2~r5PVFujVZq_hkwZea-hrYgE13OI1MXP.2 > 2025-01-27T19:40:53.093+0000 7f657bb17640 10 req 3636357367581543624 > 4333.946777344s s3:list_bucket cls_bucket_list_ordered: stopped > accumulating results at count=1001, dirent="", because its shard is > truncated and exhausted > 2025-01-27T19:40:53.093+0000 7f657bb17640 20 req 3636357367581543624 > 4333.946777344s s3:list_bucket cls_bucket_list_ordered: returning, > count=1001, is_truncated=1 > The logs make it seem like it is attempting to list a bucket thousands of times and 5187s is around 86 minutes. I only checked on one radosgw out of 6 but I suspect all are doing this in the background. I am assuming that this isn't normal. Anybody shed some light on what the logs mean? Also is there a way to link the req 3636357367581543624 to logs at lower levels? I thought that I would just need to convert to hex but it doesn't line up with the started/completed requests. Thanks, Reid On Mon, Jan 27, 2025 at 11:36 AM Joshua Baergen <jbaer...@digitalocean.com> wrote: > Hey Reid, > > This sounds similar to what we saw in > https://tracker.ceph.com/issues/62256, in case that helps with your > investigation. > > Josh > > On Mon, Jan 27, 2025 at 8:07 AM Reid Guyett <reid.guy...@gmail.com> wrote: > > > > Hello, > > > > We are experiencing slowdowns on one of our radosgw clusters. We restart > > the radosgw daemons every 2 hours and things start getting slow after an > > hour and a half. The avg get/put latencies go from 20ms/400ms to 1s/5s+ > > according to the metrics. When I stop traffic to one of the radosgw > daemon > > by setting it to DRAIN in HAProxy and HAProxy reports 0 sessions to the > > daemon, I still see the objecter_requests still going up and down 15 min > > later. linger_ops seems to stay at a constant 10 the whole time. > > > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 46 > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 211 > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 427 > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 0 > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 26 > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 190 > > > > > The requests say they are "call rgw.bucket_list in=234b" and mostly > > reference ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.N" > where N > > is 1-149 (the shard count for the bucket). > > > > > { > > > "tid": 2583119, > > > "pg": "8.579482b0", > > > "osd": 208, > > > "object_id": > > > ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124", > > > "object_locator": "@8", > > > "target_object_id": > > > ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124", > > > "target_object_locator": "@8", > > > "paused": 0, > > > "used_replica": 0, > > > "precalc_pgid": 0, > > > "last_sent": "1796975.496664s", > > > "age": 0.350012616, > > > "attempts": 1, > > > "snapid": "head", > > > "snap_context": "0=[]", > > > "mtime": "1970-01-01T00:00:00.000000+0000", > > > "osd_ops": [ > > > "call rgw.bucket_list in=234b" > > > ] > > > }, > > > > > I don't think it should be other rgw processes because we have this > > daemon's ceph.conf set to disable the other threads. > > > > > rgw enable gc threads = false > > > rgw enable lc threads = false > > > rgw dynamic resharding = false > > > > > > > When I restart the service while it is still in a DRAINED state in > HAProxy, > > checking the objecter_requests yields 0 even a few minutes after it has > > been up. > > > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 0 > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 0 > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 0 > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 0 > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 0 > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq > '.ops | > > > length' > > > 0 > > > > > > > Any thoughts on why these ops appear to be stuck/recurring until > restarting > > the daemon? I think this is related to our performance issues but I don't > > know what the fix is. > > > > The rgws are 18.2.4 running as containers in Podman on Debian 11. Our > other > > clusters do not exhibit this behavior. > > > > Thanks! > > > > Reid > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io