Re: Merging compaction improvements to 5.0

Paulo Motta Wed, 12 Feb 2025 17:36:06 -0800

Thanks Jon for the additional feedback. I will take a look at the
ticket more closely and try to reproduce the claimed improvements on
my laptop.


If there's no regression in performance, I'm +1 in including this
improvement in 5.0.

On Wed, Feb 12, 2025 at 7:28 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
>
> Hey Paulo,
>
> Great questions.  I've tested the patch fairly extensively across a wide 
> variety of AWS hardware types both EBS and not.  I believe Dave Capwell 
> tested it using infra he had available.
>
> In every case I've looked at, it's been a win, or on NVMe barely a change.  
> The reason for this is that we're fetching data from a local byte array 
> instead of the page cache.  There's no circumstance where it can be faster to 
> get data out of page cache than sequentially fetching bytes out of a byte 
> array.
>
> In the ticket I've provided extensive documentation showing how to repo using 
> easy-cass-lab and easy-cass-stress.  I've shown how to watch the filesystem 
> and block device for individual reads (xfsslower & biosnoop), you can see 
> each filesystem access, how many bytes were fetched, and how long it took.  
> I've included what I think is a fairly comprehensive analysis of the effects 
> of the patch.  I accounted for differences in instance types by switching the 
> C* version from stock 5.0 to 15452 patched.
>
> I've tried to use this as an opportunity to demonstrate what I think is the 
> level of detail that a patch like this should have, so I hope you get a 
> chance to take the time to check out the JIRA.  There's 100x more detail 
> there than I've provided in this email.
>
> Jon
>
>
> On Wed, Feb 12, 2025 at 2:10 PM Paulo Motta <pa...@apache.org> wrote:
>>
>> I'm looking forward to these improvements, compaction needs tlc. :-)
>> A couple of questions:
>>
>> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? My
>> only concern is if this is an optimization for EBS that can be a
>> deoptimization for other environments.
>>
>> Are there reproducible scripts that anyone can run to verify the
>> improvements in their own environments ? This could help alleviate any
>> concerns and gain confidence to introduce a perf. improvement in a
>> patch release.
>>
>> I have not read the ticket in detail, so apologies if this was already
>> discussed there or elsewhere.
>>
>> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
>> >
>> > Hey folks,
>> >
>> > Over the last 9 months Jordan and I have worked on CASSANDRA-15452 [1].  
>> > The TL;DR is that we're internalizing a read ahead buffer to allow us to 
>> > do fewer requests to disk during compaction and range reads.  This results 
>> > in far fewer system calls (roughly 16x reduction) and on systems with 
>> > higher read latency, a significant improvement in compaction throughput.  
>> > We've tested several different EBS configurations and found it delivers up 
>> > to a 10x improvement when read ahead is optimized to minimize read 
>> > latency.  I worked with AWS and the EBS team directly on this and the Best 
>> > Practices for C* on EBS [2] I wrote for them.  I've performance tested 
>> > this patch extensively with hundreds of billions of operations across 
>> > several clusters and thousands of compactions.  It has less of an impact 
>> > on local NVMe, since the p99 latency is already 10-30x less than what you 
>> > see on EBS (100micros vs 1-3ms), and you can do hundreds of thousands of 
>> > IOPS vs a max of 16K.
>> >
>> > Related to this, Branimir wrote CASSANDRA-20092 [3], which significantly 
>> > improves compaction by avoiding reading the partition index.  
>> > CASSANDRA-20092 has been merged to trunk already [4].
>> >
>> > I think we should merge both of these patches into 5.0, as the perf 
>> > improvement should allow teams to increase density of EBS backed C* 
>> > clusters by 2-5x, driving cost way down.  There's a lot of teams running 
>> > C* on EBS now.  I'm currently working with one that's bottlenecked on 
>> > maxed out EBS GP3 storage.  I propose we merge both, because without 
>> > CASSANDRA-20092, we won't get the performance improvements in 
>> > CASSANDRA-15452 with BTI, only BIG format.  I've tested BTI in other 
>> > situations and found it to be far more performant than BIG.
>> >
>> > If we were looking at a small win, I wouldn't care much, but since these 
>> > patches, combined with UCS, allows more teams to run C* on EBS at > 10TB / 
>> > node, I think it's worth doing now.
>> >
>> > Thanks in advance,
>> > Jon
>> >
>> > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452
>> > [2] 
>> > https://aws.amazon.com/blogs/database/best-practices-for-running-apache-cassandra-with-amazon-ebs/
>> > [3] https://issues.apache.org/jira/browse/CASSANDRA-20092
>> > [4] 
>> > https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
>> >

Re: Merging compaction improvements to 5.0

Reply via email to