Thanks Ariel for bringing this to the dev mail list.
I want to add few more notes for 20158 and 20164:

> The only problem with this approach is that the tree is not rebalanced
Actually 20158 won't build an imbalanced tree - to replace an interval in
the tree, the before and after must have the same intervals [low, high].
For such a case, even if you rebuild the tree, the new interval will just
be placed at the exact same node in the tree, which makes it very
unnecessary to rebuild the entire tree.

By default C* is running with preemptive open feature (50MB size), and with
default 160MB setup for LCS, 20158 can save you at least 50% time & CPU
from building the interval trees per compaction task (you can check the
test result from in the ticket)

> force a full rebuild periodically or based on some balance signal
Agree that I think this can be achieved by some checking on last update
time, and fallback to rebuild method when not being rebuilt say after 5
minutes or so, as safeguard.


I want to point out that 20164 (or any other better idea) is still needed
to *prevent the memtable flushing from being choked *by this low interval
tree build throughput.
I attached the test result in CASSANDRA-20159
<https://issues.apache.org/jira/browse/CASSANDRA-20159>, that even with
global improvement for interval tree build from 19596, memtable flushing
can still take more than minutes to complete.
Memtable flushing is *single-threaded* that needs to compete with other
view updates (compactions, index summary rebuild, etc.). Not all view
updates require an interval tree rebuild. Supporting addInterval will make
this update way faster (several orders of magnitudes) than at least
compaction ops (checkpoint()), which is the most common type of view
updates you can see from a normal node. With this you'll see way lower
contentions for view updates from memtable flushing and prevent the node
from being stuck.

Let me know if I can do something to improve the patches / make the review
process smoother :)


On Thu, Feb 27, 2025 at 10:20 AM Jon Haddad <j...@rustyrazorblade.com> wrote:

> I’ve encountered a handful of spinning platters, but not a lot.
>
> I think we should generally optimize for the common case, not the
> exception.
>
>
> On Thu, Feb 27, 2025 at 9:51 AM Josh McKenzie <jmcken...@apache.org>
> wrote:
>
>> This is a significant enough performance problem *in normal operations*
>> I'd consider it a bug and thus eligible for back-porting. A couple other
>> thoughts:
>>
>> CASSANDRA-20158 and CASSANDRA-20164 are several orders of magnitudes
>> faster ... The only problem with this approach is that the tree is not
>> rebalanced... it would be trivial to force a full rebuild periodically or
>> based on some balance signal.
>>
>> I'd also be strongly in support of these modifications landing in all
>> currently supported branches if someone has the time and energy to step up
>> and do the work given the combined current test coverage and robust
>> property-based additions to the testing in 19596.
>>
>> Early open which is enabled by default
>>
>> Is it fair to assume most current C* nodes will no longer be on spinning
>> disk and thus we should change this default to disabled?
>>
>> On Thu, Feb 27, 2025, at 12:35 PM, Ariel Weisberg wrote:
>>
>> Hi,
>>
>> I want to discuss what versions we should backport IntervalTree
>> improvements to specifically 19596 which I think is the lower risk option
>> because it builds the same trees as before. I think we should at least
>> backport to 5.0.
>>
>> IntervalTree performance has shown up as a problematic bottleneck in a
>> couple of scenarios. If a node ends up with lots of small sstables it will
>> fall over trying to build IntervalTrees because they are very slow. With
>> 100k sstables which isn't unusual in the normal case for leveled compaction
>> it takes 200+ milliseconds to build one on my laptop.
>>
>> When this happens compactions and flushing can't complete and the
>> mutation stage backs up along with usual problems associated with a large
>> compaction backlog. Early open which is enabled by default makes this much
>> worse because it causes compaction to rebuild IntervalTree many times more
>> often.
>>
>> The problem is the tree is immutable and currently the entire tree is
>> rebuilt every time it is updated. The proper fix is to create a persistent
>> tree, but no one has had the time at the moment and it's not a great
>> candidate for back porting to existing releases.
>>
>> CASSANDRA-19596 builds identical trees but aims to make the building
>> process significantly more efficient by removing a lot of different points
>> of overhead. The original tree does a lot of redundant sorting and
>> reallocation inside array lists. 19596 properly sizes the arrays when
>> allocating and reuses the existing sorted order to do a linear merge of
>> additions and removals. This is 2x faster to build a tree from scratch and
>> 4x faster to update.
>>
>> CASSANDRA-20158 and CASSANDRA-20164 are several orders of magnitudes
>> faster for some common operations (replace interval, add interval) and work
>> by treating the tree as a persistent tree and only rebuilding the changed
>> nodes. The only problem with this approach is that the tree is not
>> rebalanced, but in practice they still get rebuilt pretty often and it
>> would be trivial to force a full rebuild periodically or based on some
>> balance signal.
>>
>> Ariel
>>
>>
>>

Reply via email to