Hi,

I want to discuss what versions we should backport IntervalTree improvements to 
specifically 19596 which I think is the lower risk option because it builds the 
same trees as before. I think we should at least backport to 5.0.

IntervalTree performance has shown up as a problematic bottleneck in a couple 
of scenarios. If a node ends up with lots of small sstables it will fall over 
trying to build IntervalTrees because they are very slow. With 100k sstables 
which isn't unusual in the normal case for leveled compaction it takes 200+ 
milliseconds to build one on my laptop.

When this happens compactions and flushing can't complete and the mutation 
stage backs up along with usual problems associated with a large compaction 
backlog. Early open which is enabled by default makes this much worse because 
it causes compaction to rebuild IntervalTree many times more often. 

The problem is the tree is immutable and currently the entire tree is rebuilt 
every time it is updated. The proper fix is to create a persistent tree, but no 
one has had the time at the moment and it's not a great candidate for back 
porting to existing releases.

CASSANDRA-19596 builds identical trees but aims to make the building process 
significantly more efficient by removing a lot of different points of overhead. 
The original tree does a lot of redundant sorting and reallocation inside array 
lists. 19596 properly sizes the arrays when allocating and reuses the existing 
sorted order to do a linear merge of additions and removals. This is 2x faster 
to build a tree from scratch and 4x faster to update.

CASSANDRA-20158 and CASSANDRA-20164 are several orders of magnitudes faster for 
some common operations (replace interval, add interval) and work by treating 
the tree as a persistent tree and only rebuilding the changed nodes. The only 
problem with this approach is that the tree is not rebalanced, but in practice 
they still get rebuilt pretty often and it would be trivial to force a full 
rebuild periodically or based on some balance signal.

Ariel

Reply via email to