[ceph-users] Re: radosgw daemons with "stuck ops"

Reid Guyett Mon, 27 Jan 2025 12:42:28 -0800

Hi,
Looking at that tracker, I see that you were seeing some errors. I don't
really see any errors that stick out to me. When I turn the logs up to 20,
I see the following types of logs over and over:


> ...
> 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624
> 4066.843017578s s3:list_bucket list_objects_ordered: skipping past
> namespaced objects, including
> "_multipart_Data/<OBJECTNAME>2~a9lvduIp8rv3_btbiGEfOgAC-EvF85n.36"
> 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624
> 4066.843017578s s3:list_bucket list_objects_ordered: considering entry
> _multipart_Data/<OBJECTNAME>2~a9lvduIp8rv3_btbiGEfOgAC-EvF85n.39
> 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624
> 4066.843017578s s3:list_bucket list_objects_ordered: skipping past
> namespaced objects, including
> "_multipart_Data/<OBJECTNAME>2~a9lvduIp8rv3_btbiGEfOgAC-EvF85n.39"
> 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624
> 4066.843017578s s3:list_bucket list_objects_ordered: considering entry
> _multipart_Data/<OBJECTNAME>2~a9lvduIp8rv3_btbiGEfOgAC-EvF85n.5
> 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624
> 4066.843017578s s3:list_bucket list_objects_ordered: skipping past
> namespaced objects, including
> "_multipart_Data/<OBJECTNAME>2~a9lvduIp8rv3_btbiGEfOgAC-EvF85n.5"
> 2025-01-27T19:36:25.989+0000 7f657bb17640 10 req 3636357367581543624
> 4066.843017578s s3:list_bucket list_objects_ordered: end of outer loop,
> truncated=1, count=0, attempt=4829
> 2025-01-27T19:36:25.989+0000 7f657bb17640 20 req 3636357367581543624
> 4066.843017578s s3:list_bucket list_objects_ordered: starting attempt 4830
> 2025-01-27T19:36:25.989+0000 7f657bb17640 10 req 3636357367581543624
> 4066.843017578s s3:list_bucket cls_bucket_list_ordered: request from each
> of 149 shard(s) for 1001 entries to get 1001 total entries
>
> ...
> 2025-01-27T19:40:53.704+0000 7f64a716e640 20 req 8762048560071224804
> 5187.776367188s s3:list_bucket list_objects_ordered: skipping past
> namespaced objects, including
> "_multipart_Data/<OBJECTNAME2>2~XSOCVgidHCMfywQ8-Z3kestTOwfqJ--.6"
> 2025-01-27T19:40:53.704+0000 7f64a716e640 20 req 8762048560071224804
> 5187.776367188s s3:list_bucket list_objects_ordered: considering entry
> _multipart_Data/<OBJECTNAME2>2~XSOCVgidHCMfywQ8-Z3kestTOwfqJ--.7
> 2025-01-27T19:40:53.704+0000 7f64a716e640 20 req 8762048560071224804
> 5187.776367188s s3:list_bucket list_objects_ordered: skipping past
> namespaced objects, including
> "_multipart_Data/<OBJECTNAME2>2~XSOCVgidHCMfywQ8-Z3kestTOwfqJ--.7"
> 2025-01-27T19:40:53.704+0000 7f64a716e640 10 req 8762048560071224804
> 5187.776367188s s3:list_bucket list_objects_ordered: end of outer loop,
> truncated=1, count=0, attempt=9134
> 2025-01-27T19:40:53.704+0000 7f64a716e640 20 req 8762048560071224804
> 5187.776367188s s3:list_bucket list_objects_ordered: starting attempt 9135
> 2025-01-27T19:40:53.704+0000 7f64a716e640 10 req 8762048560071224804
> 5187.776367188s s3:list_bucket cls_bucket_list_ordered: request from each
> of 149 shard(s) for 1001 entries to get 1001 total entries
>
> ...
> 2025-01-27T19:40:53.093+0000 7f657bb17640 20 req 3636357367581543624
> 4333.946777344s s3:list_bucket cls_bucket_list_ordered: currently
> processing
> _multipart_Data/<OBJECTNAME3>.2~r5PVFujVZq_hkwZea-hrYgE13OI1MXP.19 from
> shard 8
> 2025-01-27T19:40:53.093+0000 7f657bb17640 10 req 3636357367581543624
> 4333.946777344s s3:list_bucket cls_bucket_list_ordered: got
> _multipart_Data/<OBJECTNAME3>.2~r5PVFujVZq_hkwZea-hrYgE13OI1MXP.19
> 2025-01-27T19:40:53.093+0000 7f657bb17640 20 req 3636357367581543624
> 4333.946777344s s3:list_bucket cls_bucket_list_ordered: currently
> processing
> _multipart_Data/<OBJECTNAME3>.2~r5PVFujVZq_hkwZea-hrYgE13OI1MXP.2 from
> shard 8
> 2025-01-27T19:40:53.093+0000 7f657bb17640 10 req 3636357367581543624
> 4333.946777344s s3:list_bucket cls_bucket_list_ordered: got
> _multipart_Data/<OBJECTNAME3>.2~r5PVFujVZq_hkwZea-hrYgE13OI1MXP.2
> 2025-01-27T19:40:53.093+0000 7f657bb17640 10 req 3636357367581543624
> 4333.946777344s s3:list_bucket cls_bucket_list_ordered: stopped
> accumulating results at count=1001, dirent="", because its shard is
> truncated and exhausted
> 2025-01-27T19:40:53.093+0000 7f657bb17640 20 req 3636357367581543624
> 4333.946777344s s3:list_bucket cls_bucket_list_ordered: returning,
> count=1001, is_truncated=1
>

The logs make it seem like it is attempting to list a bucket thousands of
times and 5187s is around 86 minutes. I only checked on one radosgw out of
6 but I suspect all are doing this in the background.

I am assuming that this isn't normal. Anybody shed some light on what the
logs mean? Also is there a way to link the req 3636357367581543624 to logs
at lower levels? I thought that I would just need to convert to hex but it
doesn't line up with the started/completed requests.

Thanks,

Reid

On Mon, Jan 27, 2025 at 11:36 AM Joshua Baergen <jbaer...@digitalocean.com>
wrote:

> Hey Reid,
>
> This sounds similar to what we saw in
> https://tracker.ceph.com/issues/62256, in case that helps with your
> investigation.
>
> Josh
>
> On Mon, Jan 27, 2025 at 8:07 AM Reid Guyett <reid.guy...@gmail.com> wrote:
> >
> > Hello,
> >
> > We are experiencing slowdowns on one of our radosgw clusters. We restart
> > the radosgw daemons every 2 hours and things start getting slow after an
> > hour and a half. The avg get/put latencies go from 20ms/400ms to 1s/5s+
> > according to the metrics. When I stop traffic to one of the radosgw
> daemon
> > by setting it to DRAIN in HAProxy and HAProxy reports 0 sessions to the
> > daemon, I still see the objecter_requests still going up and down 15 min
> > later. linger_ops seems to stay at a constant 10 the whole time.
> >
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 46
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 211
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 427
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 0
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 26
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 190
> > >
> > The requests say they are "call rgw.bucket_list in=234b" and mostly
> > reference ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.N"
> where N
> > is 1-149 (the shard count for the bucket).
> >
> > >   {
> > >     "tid": 2583119,
> > >     "pg": "8.579482b0",
> > >     "osd": 208,
> > >     "object_id":
> > > ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124",
> > >     "object_locator": "@8",
> > >     "target_object_id":
> > > ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124",
> > >     "target_object_locator": "@8",
> > >     "paused": 0,
> > >     "used_replica": 0,
> > >     "precalc_pgid": 0,
> > >     "last_sent": "1796975.496664s",
> > >     "age": 0.350012616,
> > >     "attempts": 1,
> > >     "snapid": "head",
> > >     "snap_context": "0=[]",
> > >     "mtime": "1970-01-01T00:00:00.000000+0000",
> > >     "osd_ops": [
> > >       "call rgw.bucket_list in=234b"
> > >     ]
> > >   },
> > >
> > I don't think it should be other rgw processes because we have this
> > daemon's ceph.conf set to disable the other threads.
> >
> > > rgw enable gc threads = false
> > > rgw enable lc threads = false
> > > rgw dynamic resharding = false
> > >
> >
> > When I restart the service while it is still in a DRAINED state in
> HAProxy,
> > checking the objecter_requests yields 0 even a few minutes after it has
> > been up.
> >
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 0
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 0
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 0
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 0
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 0
> > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq
> '.ops |
> > > length'
> > > 0
> > >
> >
> > Any thoughts on why these ops appear to be stuck/recurring until
> restarting
> > the daemon? I think this is related to our performance issues but I don't
> > know what the fix is.
> >
> > The rgws are 18.2.4 running as containers in Podman on Debian 11. Our
> other
> > clusters do not exhibit this behavior.
> >
> > Thanks!
> >
> > Reid
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: radosgw daemons with "stuck ops"

Reply via email to