Hi Josh, thanks! I have one more question. I try to reproduce our OSD degradation due to massive lifecycle deletion and next step I will try to fix rocksdb_cf_compact_on_deletion. But I don't understand one thing.
Okay, default auto-compaction can't detect tombstones which are growing, but regular compaction based on file size takes place as I can see. For example, we have a degraded OSD (256 Ops for a few hours). And I can see that some compaction take place: grep -E "Compaction start" /var/log/ceph/ceph-osd.75.log 2024-07-18T14:15:06.759+0300 7f6e67047700 4 rocksdb: [compaction/compaction_job.cc:1680] [default] Compaction start summary: Base version 1410 Base level 0, inputs: [366238(51MB) 366236(52MB) 366234(51MB) 366232(47MB)], [366199(67MB) 366200(67MB) 366201(67MB) 366202(11MB)] 2024-07-18T14:15:09.076+0300 7f6e67047700 4 rocksdb: [compaction/compaction_job.cc:1680] [default] Compaction start summary: Base version 1411 Base level 1, inputs: [366244(66MB)], [366204(66MB) 366205(67MB) 366206(66MB) 366207(66MB) 366208(32MB) 366209(67MB) 366210(66MB)] 2024-07-18T14:15:12.054+0300 7f6e67047700 4 rocksdb: [compaction/compaction_job.cc:1680] [default] Compaction start summary: Base version 1412 Base level 1, inputs: [366240(67MB)], [366154(55MB) 366138(67MB) 366139(67MB) 366140(67MB)] grep -E "compaction_started|compaction_finished" /var/log/ceph/ceph-osd.75.log 2024-07-18T14:15:21.094+0300 7f6e67047700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1721301321095580, "job": 2284, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L3": [366275], "files_L4": [366230, 364501], "score": 1.00751, "input_data_size": 89528358} 2024-07-18T14:15:21.727+0300 7f6e67047700 4 rocksdb: (Original Log Time 2024/07/18-14:15:21.728545) EVENT_LOG_v1 {"time_micros": 1721301321728537, "job": 2284, "event": "compaction_finished", "compaction_time_micros": 627038, "compaction_time_cpu_micros": 440211, "output_level": 4, "num_output_files": 1, "total_output_size": 66142074, "num_input_records": 1642765, "num_output_records": 1057434, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 4, 33, 173, 1004, 0, 0]} And my question is: we have regular compaction that does some work. Why It doesn't help with tombstones? Why only offline compaction can help in our case? On 17.07.2024, 16:14, "Joshua Baergen" <jbaer...@digitalocean.com <mailto:jbaer...@digitalocean.com>> wrote: Hey Aleksandr, rocksdb_delete_range_threshold has had some downsides in the past (I don't have a reference handy) so I don't recommend changing it. > As I understand, tombstones in the case of RGW it's only deletions of > objects, right? It can also happen due to bucket reshards, as this will delete the old shards after completing the reshard. Josh On Wed, Jul 17, 2024 at 3:23 AM Rudenko Aleksandr <arude...@croc.ru <mailto:arude...@croc.ru>> wrote: > > Hi Josh, > > Thank you for your reply! > > It was helpful for me, now I understand that I can't measure rocksdb > degradation using program metric ( > > In our version (16.2.13) we have this code (with new option > rocksdb_cf_compact_on_deletion). We will try using it. As I understand, > tombstones in the case of RGW it's only deletions of objects, right? Do you > knowother cases when tombstones are generated in RGW scenario? > > We have another option in our version: rocksdb_delete_range_threshold > > Do you think it can be helpful? > > I think our problem is raised due to massive deletion generated by the > lifecycle ruleof big bucket. > On 16.07.2024, 19:25, "Joshua Baergen" <jbaer...@digitalocean.com > <mailto:jbaer...@digitalocean.com> <mailto:jbaer...@digitalocean.com > <mailto:jbaer...@digitalocean.com>>> wrote: > > > Hello Aleksandr, > > > What you're probably experiencing is tombstone accumulation, a known > issue for Ceph's use of rocksdb. > > > > 1. Why can't automatic compaction manage this on its own? > > > rocksdb compaction is normally triggered by level fullness and not > tombstone counts. However, there is a feature in rocksdb that can > cause a file to be compacted if there are many tombstones found in it > when iterated that can help immensely with tombstone accumulation > problems, available as of 16.2.14. You can find a summary of how to > enable it and tweak it here: > https://www.spinics.net/lists/ceph-users/msg78514.html > <https://www.spinics.net/lists/ceph-users/msg78514.html> > <https://www.spinics.net/lists/ceph-users/msg78514.html> > <https://www.spinics.net/lists/ceph-users/msg78514.html>> > > > > 2. How can I see RocksDB levels usage or some program metric which can be > > used as a condition for manual compacting? > > > There is no tombstone counter that I'm aware of, which is really what > you need in order to trigger compaction when appropriate. > > > Josh > > > On Tue, Jul 16, 2024 at 9:12 AM Rudenko Aleksandr <arude...@croc.ru > <mailto:arude...@croc.ru> <mailto:arude...@croc.ru > <mailto:arude...@croc.ru>>> wrote: > > > > Hi, > > > > We have a big Ceph cluster (RGW case) with a lot of big buckets 10-500M > > objects with 31-1024 shards and a lot of io generated by many clients. > > Index pool placed on enterprise SSDs. We have about 120 SSDs (replication > > 3) and about 90Gb of OMAP data on each drive. > > About 75 PGs on each SSD for now. I think 75 is not enough for this amount > > of data, but I’m not sure that it is critical in our case. > > > > The problem: > > For the last few weeks, we can see big degradation of some SSD OSDs. We can > > see a lot of Ops (150-256 in perf dump) for a long time and we can see high > > avg ops time like 1-3 seconds (this metric is based on > > dump_historic_ops_by_duration and our avg calculation). These numbers are > > very unusual for our deployment. And it makes a big impact on our customers. > > > > For now, we make Offline compaction of ‘degraded’ OSDs and after > > compaction, we see that all PGs of this OSD are returned on this OSD > > (because recover is disabled during compaction) and we see that OSD works > > perfect for some time, all our metrics are dropped down for few weeks or > > only few days… > > > > And, I think the problem is in the rocksdb database which grows by levels > > and slows down requests. > > > > And I have few question about compaction: > > > > 1. Why can't automatic compaction manage this on its own? > > 2. How can I see RocksDB levels usage or some program metric which can be > > used as a condition for manual compacting? Because our metrics like request > > latency, OSD Ops count, OSD avg slow ops time are not 100% relative to > > rocksdb internal state. > > > > We try to use perf dump, bluestore.kv_xxx_lat metrics, but I think they are > > absolutely useless. Because we can see higher values after compaction and > > restart OSD then these metrics were before compaction, lol. > > > > We try to see OSD logs with debug_rocksdb=4, but we can’t understand this > > output compare it between good and bad OSDs: > > > > Uptime(secs): 3610809.5 total, 600.0 interval > > Flush(GB): cumulative 50.519, interval 0.000 > > AddFile(GB): cumulative 0.000, interval 0.000 > > AddFile(Total Files): cumulative 0, interval 0 > > AddFile(L0 Files): cumulative 0, interval 0 > > AddFile(Keys): cumulative 0, interval 0 > > Cumulative compaction: 475.34 GB write, 0.13 MB/s write, 453.07 GB read, > > 0.13 MB/s read, 3346.3 seconds > > Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 > > MB/s read, 0.0 seconds > > Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 > > level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for > > pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 > > memtable_compaction, > > 0 memtable_slowdown, interval 0 total count > > > > ** File Read Latency Histogram By Level [p-0] ** > > > > ** Compaction Stats [p-1] ** > > Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) > > Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) > > Avg(sec) KeyIn KeyDrop > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 52.0 52.0 0.0 1.0 0.0 155.2 343.25 176.27 > > 746 0.460 0 0 > > L1 4/0 204.00 MB 0.8 96.7 52.0 44.7 96.2 51.4 0.0 1.8 164.6 163.6 601.87 > > 316.34 237 2.540 275M 1237K > > L2 34/0 2.09 GB 1.0 176.8 43.6 133.2 175.3 42.1 7.6 4.0 147.6 146.3 1226.72 > > 590.20 535 2.293 519M 2672K > > L3 79/0 4.76 GB 0.3 163.6 38.9 124.7 140.7 16.1 8.8 3.6 152.6 131.3 1097.53 > > 443.60 432 2.541 401M 79M > > L4 316/0 19.96 GB 0.1 0.9 0.4 0.5 0.8 0.3 19.7 2.0 151.1 131.1 5.94 2.53 4 > > 1.485 2349K 385K > > Sum 433/0 27.01 GB 0.0 438.0 134.9 303.0 465.0 161.9 36.1 8.9 136.9 145.4 > > 3275.31 1528.94 1954 1.676 1198M 83M > > Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 > > 0 > > > > and this log is not usable as ‘metric’. > > > > Our Ceph version is: 16.2.13 and we use default bluestore_rocksdb_options. > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> > > <mailto:ceph-users@ceph.io <mailto:ceph-users@ceph.io>> > > To unsubscribe send an email to ceph-users-le...@ceph.io > > <mailto:ceph-users-le...@ceph.io> <mailto:ceph-users-le...@ceph.io > > <mailto:ceph-users-le...@ceph.io>> > > > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io