Florian,

Thank you for your detailed reply. I was right in thinking that the 223k+
usage log entries were causing my large omap object warning. You've also
confirmed my suspicions that osd_deep_scrub_large_omap_object_key_threshold
was changed between Ceph versions. I ended up trimming all of the usage
logs before 2019-10-01. First I exported the log -- it was 140MB!

radosgw-admin usage trim --start-date=2018-10-01 --end-date=2019-10-01

Interestingly enough, trimming all logs with --start-date & --end-date only
took maybe 10 seconds, but when I try to trim the usage for only a single
user/bucket, it takes over 30 minutes. Either way, after trimming the log
down considerably, I manually issued a deep scrub on pgid 5.70, after which
the Ceph health returned to HEALTH_OK

I hope this can serve as a guide for anyone else who runs into this problem
:)

Thanks again,
Dave

On Tue, Oct 29, 2019 at 3:22 AM Florian Haas <flor...@citynetwork.eu> wrote:

> Hi David,
>
> On 28/10/2019 20:44, David Monschein wrote:
> > Hi All,
> >
> > Running an object storage cluster, originally deployed with Nautilus
> > 14.2.1 and now running 14.2.4.
> >
> > Last week I was alerted to a new warning from my object storage cluster:
> >
> > [root@ceph1 ~]# ceph health detail
> > HEALTH_WARN 1 large omap objects
> > LARGE_OMAP_OBJECTS 1 large omap objects
> >     1 large objects found in pool 'default.rgw.log'
> >     Search the cluster log for 'Large omap object found' for more
> details.
> >
> > I looked into this and found the object and pool in question
> > (default.rgw.log):
> >
> > [root@ceph1 /var/log/ceph]# grep -R -i 'Large omap object found' .
> > ./ceph.log:2019-10-24 12:21:26.984802 osd.194 (osd.194) 715 : cluster
> > [WRN] Large omap object found. Object: 5:0fbdcb32:usage::usage.17:head
> > Key count: 702330 Size (bytes): 92881228
> >
> > [root@ceph1 ~]# ceph --format=json pg ls-by-pool default.rgw.log | jq
> '.[]' | egrep '(pgid|num_large_omap_objects)' | grep -v
> '"num_large_omap_objects": 0,' | grep -B1 num_large_omap_objects
> >     "pgid": "5.70",
> >       "num_large_omap_objects": 1,
> > While I was investigating, I noticed an enormous amount of entries in
> > the RGW usage log:
> >
> > [root@ceph ~]# radosgw-admin usage show | grep -c bucket
> > 223326
> > [...]
>
> I recently ran into a similar issue:
>
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AQNGVY7VJ3K6ZGRSTX3E5XIY7DBNPDHW/
>
> You have 702,330 keys on that omap object, so you would have been bitten
> by the default for osd_deep_scrub_large_omap_object_key_threshold having
> been revised down from 2,000,000 to 200,000 in 14.2.3:
>
>
> https://github.com/ceph/ceph/commit/d8180c57ac9083f414a23fd393497b2784377735
> https://tracker.ceph.com/issues/40583
>
> That's why you didn't see this warning before your recent upgrade.
>
> > There are entries for over 223k buckets! This was pretty scary to see,
> > considering we only have maybe 500 legitimate buckets in this fairly new
> > cluster. Almost all of the entries in the usage log are bogus entries
> > from anonymous users. It looks like someone/something was scanning,
> > looking for vulnerabilities, etc. Here are a few example entries, notice
> > none of the operations were successful:
>
> Caveat: whether or not you really *want* to trim the usage log is up to
> you to decide. If you are suspecting you are dealing with a security
> breach, you should definitely export and preserve the usage log before
> you trim it, or else delay trimming until you have properly investigated
> your problem.
>
> *If* you decide you no longer need those usage log entries, you can use
> "radosgw-admin usage trim" with appropriate --start-date, --end-date,
> and/or --uid options, to clean them up:
>
> https://docs.ceph.com/docs/nautilus/radosgw/admin/#trim-usage
>
> Please let me know if that information is helpful. Thank you!
>
> Cheers,
> Florian
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to