Hi,

Thanks for pointing that out Yuqi. I'm going to work on merging 20158 and 20164 
next. 

Since there are no objections I am going to assume lazy consensus and merge 
back to 5.0.

Ariel

On Thu, Feb 27, 2025, at 2:44 PM, Yuqi Yan wrote:
> Thanks Ariel for bringing this to the dev mail list. 
> I want to add few more notes for 20158 and 20164:
> 
> > The only problem with this approach is that the tree is not rebalanced
> Actually 20158 won't build an imbalanced tree - to replace an interval in the 
> tree, the before and after must have the same intervals [low, high]. 
> For such a case, even if you rebuild the tree, the new interval will just be 
> placed at the exact same node in the tree, which makes it very unnecessary to 
> rebuild the entire tree.
> 
> By default C* is running with preemptive open feature (50MB size), and with 
> default 160MB setup for LCS, 20158 can save you at least 50% time & CPU from 
> building the interval trees per compaction task (you can check the test 
> result from in the ticket)
> 
> > force a full rebuild periodically or based on some balance signal
> Agree that I think this can be achieved by some checking on last update time, 
> and fallback to rebuild method when not being rebuilt say after 5 minutes or 
> so, as safeguard.
> 
> 
> I want to point out that 20164 (or any other better idea) is still needed to 
> *prevent the memtable flushing from being choked *by this low interval tree 
> build throughput.
> I attached the test result in CASSANDRA-20159 
> <https://issues.apache.org/jira/browse/CASSANDRA-20159>, that even with 
> global improvement for interval tree build from 19596, memtable flushing can 
> still take more than minutes to complete.
> Memtable flushing is *single-threaded* that needs to compete with other view 
> updates (compactions, index summary rebuild, etc.). Not all view updates 
> require an interval tree rebuild. Supporting addInterval will make this 
> update way faster (several orders of magnitudes) than at least compaction ops 
> (checkpoint()), which is the most common type of view updates you can see 
> from a normal node. With this you'll see way lower contentions for view 
> updates from memtable flushing and prevent the node from being stuck.
> 
> Let me know if I can do something to improve the patches / make the review 
> process smoother :)
> 
> 
> On Thu, Feb 27, 2025 at 10:20 AM Jon Haddad <j...@rustyrazorblade.com> wrote:
>> I’ve encountered a handful of spinning platters, but not a lot. 
>> 
>> I think we should generally optimize for the common case, not the exception. 
>> 
>> 
>> On Thu, Feb 27, 2025 at 9:51 AM Josh McKenzie <jmcken...@apache.org> wrote:
>>> __
>>> This is a significant enough performance problem *in normal operations* I'd 
>>> consider it a bug and thus eligible for back-porting. A couple other 
>>> thoughts:
>>> 
>>>> CASSANDRA-20158 and CASSANDRA-20164 are several orders of magnitudes 
>>>> faster ... The only problem with this approach is that the tree is not 
>>>> rebalanced... it would be trivial to force a full rebuild periodically or 
>>>> based on some balance signal.
>>> I'd also be strongly in support of these modifications landing in all 
>>> currently supported branches if someone has the time and energy to step up 
>>> and do the work given the combined current test coverage and robust 
>>> property-based additions to the testing in 19596.
>>> 
>>>> Early open which is enabled by default
>>> Is it fair to assume most current C* nodes will no longer be on spinning 
>>> disk and thus we should change this default to disabled?
>>> 
>>> On Thu, Feb 27, 2025, at 12:35 PM, Ariel Weisberg wrote:
>>>> Hi,
>>>> 
>>>> I want to discuss what versions we should backport IntervalTree 
>>>> improvements to specifically 19596 which I think is the lower risk option 
>>>> because it builds the same trees as before. I think we should at least 
>>>> backport to 5.0.
>>>> 
>>>> IntervalTree performance has shown up as a problematic bottleneck in a 
>>>> couple of scenarios. If a node ends up with lots of small sstables it will 
>>>> fall over trying to build IntervalTrees because they are very slow. With 
>>>> 100k sstables which isn't unusual in the normal case for leveled 
>>>> compaction it takes 200+ milliseconds to build one on my laptop.
>>>> 
>>>> When this happens compactions and flushing can't complete and the mutation 
>>>> stage backs up along with usual problems associated with a large 
>>>> compaction backlog. Early open which is enabled by default makes this much 
>>>> worse because it causes compaction to rebuild IntervalTree many times more 
>>>> often. 
>>>> 
>>>> The problem is the tree is immutable and currently the entire tree is 
>>>> rebuilt every time it is updated. The proper fix is to create a persistent 
>>>> tree, but no one has had the time at the moment and it's not a great 
>>>> candidate for back porting to existing releases.
>>>> 
>>>> CASSANDRA-19596 builds identical trees but aims to make the building 
>>>> process significantly more efficient by removing a lot of different points 
>>>> of overhead. The original tree does a lot of redundant sorting and 
>>>> reallocation inside array lists. 19596 properly sizes the arrays when 
>>>> allocating and reuses the existing sorted order to do a linear merge of 
>>>> additions and removals. This is 2x faster to build a tree from scratch and 
>>>> 4x faster to update.
>>>> 
>>>> CASSANDRA-20158 and CASSANDRA-20164 are several orders of magnitudes 
>>>> faster for some common operations (replace interval, add interval) and 
>>>> work by treating the tree as a persistent tree and only rebuilding the 
>>>> changed nodes. The only problem with this approach is that the tree is not 
>>>> rebalanced, but in practice they still get rebuilt pretty often and it 
>>>> would be trivial to force a full rebuild periodically or based on some 
>>>> balance signal.
>>>> 
>>>> Ariel
>>>> 
>>> 

Reply via email to