[ceph-users] Re: Large omap in index pool even if properly sharded and not "OVER"

Casey Bodley Wed, 10 Jul 2024 06:25:34 -0700

On Tue, Jul 9, 2024 at 12:41 PM Szabo, Istvan (Agoda)
<istvan.sz...@agoda.com> wrote:
>
> Hi Casey,
>
> 1.
> Regarding versioning, the user doesn't use verisoning it if I'm not mistaken:
> https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
>
> 2.
> Regarding multiparts, if it would have multipart thrash, it would be listed 
> here:
> https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
> as a rgw.multimeta under the usage, right?
>
> 3.
> Regarding the multisite idea, this bucket has been a multisite bucket last 
> year but we had to reshard (accepting to loose the replica on the 2nd site 
> and just keep it in the master site) and that time as expected it has 
> disappeared completely from the 2nd site (I guess the 40TB thrash still there 
> but can't really find it how to clean 🙁 ). Now it is a single site bucket.
> Also it is the index pool, multisite logs should go to the rgw.log pool 
> shouldn't it?


some replication logs are in the log pool, but the per-object logs are
stored in the bucket index objects. you can inspect these with
`radosgw-admin bilog list --bucket=X`. by default, that will only list
--max-entries=1000. you can add --shard-id=Y to look at specific
'large omap' objects

even if your single-site bucket doesn't exist on the secondary zone,
changes on the primary zone are probably still generating these bilog
entries. you would need to do something like `radosgw-admin bucket
sync disable --bucket=X` to make it stop. because you don't expect
these changes to replicate, it's safe to delete any of this bucket's
bilog entries with `radosgw-admin bilog trim --end-marker 9
--bucket=X`. depending on ceph version, you may need to run this trim
command in a loop until the `bilog list` output is empty

radosgw does eventually trim bilogs in the background after they're
processed, but the secondary zone isn't processing them in this case

>
> Thank you
>
>
> ________________________________
> From: Casey Bodley <cbod...@redhat.com>
> Sent: Tuesday, July 9, 2024 10:39 PM
> To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>
> Cc: Eugen Block <ebl...@nde.ag>; ceph-users@ceph.io <ceph-users@ceph.io>
> Subject: Re: [ceph-users] Re: Large omap in index pool even if properly 
> sharded and not "OVER"
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> ________________________________
>
> in general, these omap entries should be evenly spread over the
> bucket's index shard objects. but there are two features that may
> cause entries to clump on a single shard:
>
> 1. for versioned buckets, multiple versions of the same object name
> map to the same index shard. this can become an issue if an
> application is repeatedly overwriting an object without cleaning up
> old versions. lifecycle rules can help to manage these noncurrent
> versions
>
> 2. during a multipart upload, all of the parts are tracked on the same
> index shard as the final object name. if applications are leaving a
> lot of incomplete multipart uploads behind (especially if they target
> the same object name) this can lead to similar clumping. the S3 api
> has operations to list and abort incomplete multipart uploads, along
> with lifecycle rules to automate their cleanup
>
> separately, multisite clusters use these same index shards to store
> replication logs. if sync gets far enough behind, these log entries
> can also lead to large omap warnings
>
> On Tue, Jul 9, 2024 at 10:25 AM Szabo, Istvan (Agoda)
> <istvan.sz...@agoda.com> wrote:
> >
> > It's the same bucket:
> > https://gist.github.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d
> >
> >
> > ________________________________
> > From: Eugen Block <ebl...@nde.ag>
> > Sent: Tuesday, July 9, 2024 8:03 PM
> > To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>
> > Cc: ceph-users@ceph.io <ceph-users@ceph.io>
> > Subject: Re: [ceph-users] Re: Large omap in index pool even if properly 
> > sharded and not "OVER"
> >
> > Email received from the internet. If in doubt, don't click any link nor 
> > open any attachment !
> > ________________________________
> >
> > Are those three different buckets? Could you share the stats for each of 
> > them?
> >
> > radosgw-admin bucket stats --bucket=<BUCKET>
> >
> > Zitat von "Szabo, Istvan (Agoda)" <istvan.sz...@agoda.com>:
> >
> > > Hello,
> > >
> > > Yeah, still:
> > >
> > > the .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l
> > > 290005
> > >
> > > and the
> > > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726 | wc -l
> > > 289378
> > >
> > > And just make me happy more I have one more
> > > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.6 | wc -l
> > > 181588
> > >
> > > This is my crush tree (I'm using host based crush rule)
> > > https://gist.githubusercontent.com/Badb0yBadb0y/9bea911701184a51575619bc99cca94d/raw/e5e4a918d327769bb874aaed279a8428fd7150d5/gistfile1.txt
> > >
> > > I'm thinking could that be the issue that host 2s13-15 has less nvme
> > > osd (however size wise same as in the other 12 host where have 8x
> > > nvme osd) than the others?
> > > But the pgs are located like this:
> > >
> > > pg26.427
> > > osd.261 host8
> > > osd.488 host13
> > > osd.276 host4
> > >
> > > pg26.606
> > > osd.443 host12
> > > osd.197 host8
> > > osd.524 host14
> > >
> > > pg26.78c
> > > osd.89 host7
> > > osd.406 host11
> > > osd.254 host6
> > >
> > > If pg26.78c wouldn't be here I'd say 100% the nvme osd distribution
> > > based on host is the issue, however this pg is not located on any of
> > > the 4x nvme osd nodes 😕
> > >
> > > Ty
> > >
> > > ________________________________
> > > From: Eugen Block <ebl...@nde.ag>
> > > Sent: Tuesday, July 9, 2024 6:02 PM
> > > To: ceph-users@ceph.io <ceph-users@ceph.io>
> > > Subject: [ceph-users] Re: Large omap in index pool even if properly
> > > sharded and not "OVER"
> > >
> > > Email received from the internet. If in doubt, don't click any link
> > > nor open any attachment !
> > > ________________________________
> > >
> > > Hi,
> > >
> > > the number of shards looks fine, maybe this was just a temporary
> > > burst? Did you check if the rados objects in the index pool still have
> > > more than 200k omap objects? I would try someting like
> > >
> > > rados -p <index_pool> listomapkeys
> > > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l
> > >
> > >
> > > Zitat von "Szabo, Istvan (Agoda)" <istvan.sz...@agoda.com>:
> > >
> > >> Hi,
> > >>
> > >> I have a pretty big bucket which sharded with 1999 shard so in
> > >> theory can hold close to 200m objects (199.900.000).
> > >> Currently it has 54m objects.
> > >>
> > >> Bucket limit check looks also good:
> > >>  "bucket": ""xyz,
> > >>  "tenant": "",
> > >>  "num_objects": 53619489,
> > >>  "num_shards": 1999,
> > >>  "objects_per_shard": 26823,
> > >>  "fill_status": "OK"
> > >>
> > >> This is the bucket id:
> > >> "id": "9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1"
> > >>
> > >> This is the log lines:
> > >> 2024-06-27T10:41:05.679870+0700 osd.261 (osd.261) 9643 : cluster
> > >> [WRN] Large omap object found. Object:
> > >> 26:e433e65c:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151:head
> > >>  PG: 26.3a67cc27 (26.427) Key count: 236919 Size
> > >> (bytes):
> > >> 89969920
> > >>
> > >> 2024-06-27T10:43:35.557835+0700 osd.89 (osd.89) 9000 : cluster [WRN]
> > >> Large omap object found. Object:
> > >> 26:31ff4df1:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726:head
> > >>  PG: 26.8fb2ff8c (26.78c) Key count: 236495 Size
> > >> (bytes):
> > >> 95560458
> > >>
> > >> Tried to deep scrub the affected pgs, tried to deep-scrub the
> > >> mentioned osds in the log but didn't help.
> > >> Why? What I'm missing?
> > >>
> > >> Thank you in advance for your help.
> > >>
> > >> ________________________________
> > >> This message is confidential and is for the sole use of the intended
> > >> recipient(s). It may also be privileged or otherwise protected by
> > >> copyright or other legal rules. If you have received it by mistake
> > >> please let us know by reply email and delete it from your system. It
> > >> is prohibited to copy this message or disclose its content to
> > >> anyone. Any confidentiality or privilege is not waived or lost by
> > >> any mistaken delivery or unauthorized disclosure of the message. All
> > >> messages sent to and from Agoda may be monitored to ensure
> > >> compliance with company policies, to protect the company's interests
> > >> and to remove potential malware. Electronic messages may be
> > >> intercepted, amended, lost or deleted, or contain viruses.
> > >> _______________________________________________
> > >> ceph-users mailing list -- ceph-users@ceph.io
> > >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > > ________________________________
> > > This message is confidential and is for the sole use of the intended
> > > recipient(s). It may also be privileged or otherwise protected by
> > > copyright or other legal rules. If you have received it by mistake
> > > please let us know by reply email and delete it from your system. It
> > > is prohibited to copy this message or disclose its content to
> > > anyone. Any confidentiality or privilege is not waived or lost by
> > > any mistaken delivery or unauthorized disclosure of the message. All
> > > messages sent to and from Agoda may be monitored to ensure
> > > compliance with company policies, to protect the company's interests
> > > and to remove potential malware. Electronic messages may be
> > > intercepted, amended, lost or deleted, or contain viruses.
> >
> >
> >
> >
> > ________________________________
> > This message is confidential and is for the sole use of the intended 
> > recipient(s). It may also be privileged or otherwise protected by copyright 
> > or other legal rules. If you have received it by mistake please let us know 
> > by reply email and delete it from your system. It is prohibited to copy 
> > this message or disclose its content to anyone. Any confidentiality or 
> > privilege is not waived or lost by any mistaken delivery or unauthorized 
> > disclosure of the message. All messages sent to and from Agoda may be 
> > monitored to ensure compliance with company policies, to protect the 
> > company's interests and to remove potential malware. Electronic messages 
> > may be intercepted, amended, lost or deleted, or contain viruses.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Large omap in index pool even if properly sharded and not "OVER"

Reply via email to