[ https://issues.apache.org/jira/browse/CASSANDRA-20158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ariel Weisberg updated CASSANDRA-20158: --------------------------------------- Resolution: Duplicate Status: Resolved (was: Open) > IntervalTree should support copyAndReplace for checkpoint when ranges are > unchanged > ----------------------------------------------------------------------------------- > > Key: CASSANDRA-20158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-20158 > Project: Apache Cassandra > Issue Type: Improvement > Components: Local/Compaction, Local/SSTable > Reporter: Yuqi Yan > Assignee: Yuqi Yan > Priority: Normal > Fix For: 4.1.x > > Attachments: image-2024-12-20-02-39-53-420.png, > image-2024-12-20-02-41-06-544.png, image-2024-12-20-02-42-52-003.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > We observed very slow compaction and sometimes stuck memtable flushing hence > caused write latency spikes when the cluster has large number of SSTables > (~20K), similar to what was observed in CASSANDRA-19596. > Looking deeper into when these interval tree is rebuilt - there is actually > no need to do rebuild all the time for checkpoint() calls. > > updateLiveSet(toUpdate, staged.update) this is updating the current version > of SSTableReader within the View. However the update isn't always changing > the ranges of the SSTableReader (low, high). > > One Example: > * IndexSummaryRedistribution.adjustSamplingLevels() > ** SSTableReader replacement = sstable.cloneWithNewSummarySamplingLevel(cfs, > entry.newSamplingLevel); > ** This is changing the Metadata only and the ranges are unchanged > > Considering this, rebuilding the entire IntervalTree will not be required, > instead IntervalTree should support replacing these SSTableReader. > > If we're rebuilding the tree, complexity is O(n(logn)^2) in current trunk as > we're repeating the O(nlogn) sort on every node creation, after > CASSANDRA-19596 this will be O(nlogn), but with update supported, some of the > updateLiveSet calls can be optimized to O(m(logn)^2) where m is the number of > SSTableReaders we attempt to replace, which we have m << n (number of > SSTables) in most cases. > > This is achieved by > # finding the node containing the SSTable (logn) > # binary search and replacing the SSTableReader from the node (logn) > # To support CAS update, 1 and 2 need to done by copying the path and > re-create the affected nodes on the path > > The experiment I did was on a 2 rings setup, one on 4.1+CASSANDRA-19596 > (marked as trunk), and the other on 4.1+CASSANDRA-19596+this patch (marked > as new). ~15K SSTables (with LCS, sstable size was 50MB, single_uplevel > enabled). stress-test with 1:1 rw ratio. > Result shows that ~15% of the checkpoint calls don't necessarily need to > rebuild the tree. > !image-2024-12-20-02-41-06-544.png|width=803,height=133! > Compaction throughput (scanner read throughput) was increased from 130MB/s to > 200MB/s > !image-2024-12-20-02-39-53-420.png|width=1378,height=136! > Checkpoint finish time reduced from mean ~1.5s to ~800ms > !image-2024-12-20-02-42-52-003.png|width=1100,height=186! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org