[ceph-users] Re: How to detect condition for offline compaction of RocksDB?

Rudenko Aleksandr Thu, 18 Jul 2024 04:59:31 -0700

Hi Josh, thanks!

I have one more question. I try to reproduce our OSD degradation due to massive 
lifecycle deletion and next step I will try to fix 
rocksdb_cf_compact_on_deletion. But I don't understand one thing.


Okay, default auto-compaction can't detect tombstones which are growing, but 
regular compaction based on file size takes place as I can see.

For example, we have a degraded OSD (256 Ops for a few hours).

And I can see that some compaction take place:

grep -E "Compaction start" /var/log/ceph/ceph-osd.75.log

2024-07-18T14:15:06.759+0300 7f6e67047700  4 rocksdb: 
[compaction/compaction_job.cc:1680] [default] Compaction start summary: Base 
version 1410 Base level 0, inputs: [366238(51MB) 366236(52MB) 366234(51MB) 
366232(47MB)], [366199(67MB) 366200(67MB) 366201(67MB) 366202(11MB)]
2024-07-18T14:15:09.076+0300 7f6e67047700  4 rocksdb: 
[compaction/compaction_job.cc:1680] [default] Compaction start summary: Base 
version 1411 Base level 1, inputs: [366244(66MB)], [366204(66MB) 366205(67MB) 
366206(66MB) 366207(66MB) 366208(32MB) 366209(67MB) 366210(66MB)]
2024-07-18T14:15:12.054+0300 7f6e67047700  4 rocksdb: 
[compaction/compaction_job.cc:1680] [default] Compaction start summary: Base 
version 1412 Base level 1, inputs: [366240(67MB)], [366154(55MB) 366138(67MB) 
366139(67MB) 366140(67MB)]

grep -E "compaction_started|compaction_finished" /var/log/ceph/ceph-osd.75.log

2024-07-18T14:15:21.094+0300 7f6e67047700  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1721301321095580, "job": 2284, "event": "compaction_started", 
"compaction_reason": "LevelMaxLevelSize", "files_L3": [366275], "files_L4": 
[366230, 364501], "score": 1.00751, "input_data_size": 89528358}
2024-07-18T14:15:21.727+0300 7f6e67047700  4 rocksdb: (Original Log Time 
2024/07/18-14:15:21.728545) EVENT_LOG_v1 {"time_micros": 1721301321728537, 
"job": 2284, "event": "compaction_finished", "compaction_time_micros": 627038, 
"compaction_time_cpu_micros": 440211, "output_level": 4, "num_output_files": 1, 
"total_output_size": 66142074, "num_input_records": 1642765, 
"num_output_records": 1057434, "num_subcompactions": 1, "output_compression": 
"NoCompression", "num_single_delete_mismatches": 0, 
"num_single_delete_fallthrough": 0, "lsm_state": [0, 4, 33, 173, 1004, 0, 0]}

And my question is: we have regular compaction that does some work. Why It 
doesn't help with tombstones?
Why only offline compaction can help in our case?


On 17.07.2024, 16:14, "Joshua Baergen" <jbaer...@digitalocean.com 
<mailto:jbaer...@digitalocean.com>> wrote:


Hey Aleksandr,


rocksdb_delete_range_threshold has had some downsides in the past (I
don't have a reference handy) so I don't recommend changing it.


> As I understand, tombstones in the case of RGW it's only deletions of 
> objects, right?


It can also happen due to bucket reshards, as this will delete the old
shards after completing the reshard.


Josh


On Wed, Jul 17, 2024 at 3:23 AM Rudenko Aleksandr <arude...@croc.ru 
<mailto:arude...@croc.ru>> wrote:
>
> Hi Josh,
>
> Thank you for your reply!
>
> It was helpful for me, now I understand that I can't measure rocksdb 
> degradation using program metric (
>
> In our version (16.2.13) we have this code (with new option 
> rocksdb_cf_compact_on_deletion). We will try using it. As I understand, 
> tombstones in the case of RGW it's only deletions of objects, right? Do you 
> knowother cases when tombstones are generated in RGW scenario?
>
> We have another option in our version: rocksdb_delete_range_threshold
>
> Do you think it can be helpful?
>
> I think our problem is raised due to massive deletion generated by the 
> lifecycle ruleof big bucket.
> On 16.07.2024, 19:25, "Joshua Baergen" <jbaer...@digitalocean.com 
> <mailto:jbaer...@digitalocean.com> <mailto:jbaer...@digitalocean.com 
> <mailto:jbaer...@digitalocean.com>>> wrote:
>
>
> Hello Aleksandr,
>
>
> What you're probably experiencing is tombstone accumulation, a known
> issue for Ceph's use of rocksdb.
>
>
> > 1. Why can't automatic compaction manage this on its own?
>
>
> rocksdb compaction is normally triggered by level fullness and not
> tombstone counts. However, there is a feature in rocksdb that can
> cause a file to be compacted if there are many tombstones found in it
> when iterated that can help immensely with tombstone accumulation
> problems, available as of 16.2.14. You can find a summary of how to
> enable it and tweak it here:
> https://www.spinics.net/lists/ceph-users/msg78514.html 
> <https://www.spinics.net/lists/ceph-users/msg78514.html> 
> <https://www.spinics.net/lists/ceph-users/msg78514.html> 
> <https://www.spinics.net/lists/ceph-users/msg78514.html&gt;>
>
>
> > 2. How can I see RocksDB levels usage or some program metric which can be 
> > used as a condition for manual compacting?
>
>
> There is no tombstone counter that I'm aware of, which is really what
> you need in order to trigger compaction when appropriate.
>
>
> Josh
>
>
> On Tue, Jul 16, 2024 at 9:12 AM Rudenko Aleksandr <arude...@croc.ru 
> <mailto:arude...@croc.ru> <mailto:arude...@croc.ru 
> <mailto:arude...@croc.ru>>> wrote:
> >
> > Hi,
> >
> > We have a big Ceph cluster (RGW case) with a lot of big buckets 10-500M 
> > objects with 31-1024 shards and a lot of io generated by many clients.
> > Index pool placed on enterprise SSDs. We have about 120 SSDs (replication 
> > 3) and about 90Gb of OMAP data on each drive.
> > About 75 PGs on each SSD for now. I think 75 is not enough for this amount 
> > of data, but I’m not sure that it is critical in our case.
> >
> > The problem:
> > For the last few weeks, we can see big degradation of some SSD OSDs. We can 
> > see a lot of Ops (150-256 in perf dump) for a long time and we can see high 
> > avg ops time like 1-3 seconds (this metric is based on 
> > dump_historic_ops_by_duration and our avg calculation). These numbers are 
> > very unusual for our deployment. And it makes a big impact on our customers.
> >
> > For now, we make Offline compaction of ‘degraded’ OSDs and after 
> > compaction, we see that all PGs of this OSD are returned on this OSD 
> > (because recover is disabled during compaction) and we see that OSD works 
> > perfect for some time, all our metrics are dropped down for few weeks or 
> > only few days…
> >
> > And, I think the problem is in the rocksdb database which grows by levels 
> > and slows down requests.
> >
> > And I have few question about compaction:
> >
> > 1. Why can't automatic compaction manage this on its own?
> > 2. How can I see RocksDB levels usage or some program metric which can be 
> > used as a condition for manual compacting? Because our metrics like request 
> > latency, OSD Ops count, OSD avg slow ops time are not 100% relative to 
> > rocksdb internal state.
> >
> > We try to use perf dump, bluestore.kv_xxx_lat metrics, but I think they are 
> > absolutely useless. Because we can see higher values after compaction and 
> > restart OSD then these metrics were before compaction, lol.
> >
> > We try to see OSD logs with debug_rocksdb=4, but we can’t understand this 
> > output compare it between good and bad OSDs:
> >
> > Uptime(secs): 3610809.5 total, 600.0 interval
> > Flush(GB): cumulative 50.519, interval 0.000
> > AddFile(GB): cumulative 0.000, interval 0.000
> > AddFile(Total Files): cumulative 0, interval 0
> > AddFile(L0 Files): cumulative 0, interval 0
> > AddFile(Keys): cumulative 0, interval 0
> > Cumulative compaction: 475.34 GB write, 0.13 MB/s write, 453.07 GB read, 
> > 0.13 MB/s read, 3346.3 seconds
> > Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 
> > MB/s read, 0.0 seconds
> > Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 
> > level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for 
> > pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 
> > memtable_compaction,
> > 0 memtable_slowdown, interval 0 total count
> >
> > ** File Read Latency Histogram By Level [p-0] **
> >
> > ** Compaction Stats [p-1] **
> > Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) 
> > Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) 
> > Avg(sec) KeyIn KeyDrop
> > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 52.0 52.0 0.0 1.0 0.0 155.2 343.25 176.27 
> > 746 0.460 0 0
> > L1 4/0 204.00 MB 0.8 96.7 52.0 44.7 96.2 51.4 0.0 1.8 164.6 163.6 601.87 
> > 316.34 237 2.540 275M 1237K
> > L2 34/0 2.09 GB 1.0 176.8 43.6 133.2 175.3 42.1 7.6 4.0 147.6 146.3 1226.72 
> > 590.20 535 2.293 519M 2672K
> > L3 79/0 4.76 GB 0.3 163.6 38.9 124.7 140.7 16.1 8.8 3.6 152.6 131.3 1097.53 
> > 443.60 432 2.541 401M 79M
> > L4 316/0 19.96 GB 0.1 0.9 0.4 0.5 0.8 0.3 19.7 2.0 151.1 131.1 5.94 2.53 4 
> > 1.485 2349K 385K
> > Sum 433/0 27.01 GB 0.0 438.0 134.9 303.0 465.0 161.9 36.1 8.9 136.9 145.4 
> > 3275.31 1528.94 1954 1.676 1198M 83M
> > Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 > > 0
> >
> > and this log is not usable as ‘metric’.
> >
> > Our Ceph version is: 16.2.13 and we use default bluestore_rocksdb_options.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> 
> > <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>>
> > To unsubscribe send an email to ceph-users-le...@ceph.io 
> > <mailto:ceph-users-le...@ceph.io> <mailto:ceph-users-le...@ceph.io 
> > <mailto:ceph-users-le...@ceph.io>>
>
>
>



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to detect condition for offline compaction of RocksDB?

Reply via email to