Hi, Thanks for pointing that out Yuqi. I'm going to work on merging 20158 and 20164 next.
Since there are no objections I am going to assume lazy consensus and merge back to 5.0. Ariel On Thu, Feb 27, 2025, at 2:44 PM, Yuqi Yan wrote: > Thanks Ariel for bringing this to the dev mail list. > I want to add few more notes for 20158 and 20164: > > > The only problem with this approach is that the tree is not rebalanced > Actually 20158 won't build an imbalanced tree - to replace an interval in the > tree, the before and after must have the same intervals [low, high]. > For such a case, even if you rebuild the tree, the new interval will just be > placed at the exact same node in the tree, which makes it very unnecessary to > rebuild the entire tree. > > By default C* is running with preemptive open feature (50MB size), and with > default 160MB setup for LCS, 20158 can save you at least 50% time & CPU from > building the interval trees per compaction task (you can check the test > result from in the ticket) > > > force a full rebuild periodically or based on some balance signal > Agree that I think this can be achieved by some checking on last update time, > and fallback to rebuild method when not being rebuilt say after 5 minutes or > so, as safeguard. > > > I want to point out that 20164 (or any other better idea) is still needed to > *prevent the memtable flushing from being choked *by this low interval tree > build throughput. > I attached the test result in CASSANDRA-20159 > <https://issues.apache.org/jira/browse/CASSANDRA-20159>, that even with > global improvement for interval tree build from 19596, memtable flushing can > still take more than minutes to complete. > Memtable flushing is *single-threaded* that needs to compete with other view > updates (compactions, index summary rebuild, etc.). Not all view updates > require an interval tree rebuild. Supporting addInterval will make this > update way faster (several orders of magnitudes) than at least compaction ops > (checkpoint()), which is the most common type of view updates you can see > from a normal node. With this you'll see way lower contentions for view > updates from memtable flushing and prevent the node from being stuck. > > Let me know if I can do something to improve the patches / make the review > process smoother :) > > > On Thu, Feb 27, 2025 at 10:20 AM Jon Haddad <j...@rustyrazorblade.com> wrote: >> I’ve encountered a handful of spinning platters, but not a lot. >> >> I think we should generally optimize for the common case, not the exception. >> >> >> On Thu, Feb 27, 2025 at 9:51 AM Josh McKenzie <jmcken...@apache.org> wrote: >>> __ >>> This is a significant enough performance problem *in normal operations* I'd >>> consider it a bug and thus eligible for back-porting. A couple other >>> thoughts: >>> >>>> CASSANDRA-20158 and CASSANDRA-20164 are several orders of magnitudes >>>> faster ... The only problem with this approach is that the tree is not >>>> rebalanced... it would be trivial to force a full rebuild periodically or >>>> based on some balance signal. >>> I'd also be strongly in support of these modifications landing in all >>> currently supported branches if someone has the time and energy to step up >>> and do the work given the combined current test coverage and robust >>> property-based additions to the testing in 19596. >>> >>>> Early open which is enabled by default >>> Is it fair to assume most current C* nodes will no longer be on spinning >>> disk and thus we should change this default to disabled? >>> >>> On Thu, Feb 27, 2025, at 12:35 PM, Ariel Weisberg wrote: >>>> Hi, >>>> >>>> I want to discuss what versions we should backport IntervalTree >>>> improvements to specifically 19596 which I think is the lower risk option >>>> because it builds the same trees as before. I think we should at least >>>> backport to 5.0. >>>> >>>> IntervalTree performance has shown up as a problematic bottleneck in a >>>> couple of scenarios. If a node ends up with lots of small sstables it will >>>> fall over trying to build IntervalTrees because they are very slow. With >>>> 100k sstables which isn't unusual in the normal case for leveled >>>> compaction it takes 200+ milliseconds to build one on my laptop. >>>> >>>> When this happens compactions and flushing can't complete and the mutation >>>> stage backs up along with usual problems associated with a large >>>> compaction backlog. Early open which is enabled by default makes this much >>>> worse because it causes compaction to rebuild IntervalTree many times more >>>> often. >>>> >>>> The problem is the tree is immutable and currently the entire tree is >>>> rebuilt every time it is updated. The proper fix is to create a persistent >>>> tree, but no one has had the time at the moment and it's not a great >>>> candidate for back porting to existing releases. >>>> >>>> CASSANDRA-19596 builds identical trees but aims to make the building >>>> process significantly more efficient by removing a lot of different points >>>> of overhead. The original tree does a lot of redundant sorting and >>>> reallocation inside array lists. 19596 properly sizes the arrays when >>>> allocating and reuses the existing sorted order to do a linear merge of >>>> additions and removals. This is 2x faster to build a tree from scratch and >>>> 4x faster to update. >>>> >>>> CASSANDRA-20158 and CASSANDRA-20164 are several orders of magnitudes >>>> faster for some common operations (replace interval, add interval) and >>>> work by treating the tree as a persistent tree and only rebuilding the >>>> changed nodes. The only problem with this approach is that the tree is not >>>> rebalanced, but in practice they still get rebuilt pretty often and it >>>> would be trivial to force a full rebuild periodically or based on some >>>> balance signal. >>>> >>>> Ariel >>>> >>>