Re: Merging compaction improvements to 5.0

Patrick McFadin Thu, 13 Feb 2025 08:31:04 -0800

I’ve been following this for a while and I think it’s just some solid
engineering based on real-world challenges. Probably one of the best types
of contributions to have. I’m +1 on adding it to 5


Patrick

On Thu, Feb 13, 2025 at 7:31 AM Dmitry Konstantinov <[email protected]>
wrote:

> +1 (nb) from my side, I raised a few comments for CASSANDRA-15452 some
> time ago and Jordan addressed them.
> I have also backported CASSANDRA-15452 changes to my internal 4.1 fork and
> got about 15% reduction in compaction time even for a node with a local SSD.
>
> On Thu, 13 Feb 2025 at 13:22, Jordan West <[email protected]> wrote:
>
>> For 15452 that’s correct (and I believe also for 20092). For 15452, the
>> trunk and 5.0 patch are basically identical.
>>
>> Jordan
>>
>> On Thu, Feb 13, 2025 at 01:06 C. Scott Andreas <[email protected]>
>> wrote:
>>
>>> Checking to confirm the specific patches proposed for backport – is it
>>> the trunk commit for C-20092 and the open GitHub PR against the 5.0 branch
>>> for C-15452 linked below?
>>>
>>> CASSANDRA-20092: Introduce SSTableSimpleScanner for compaction
>>> (committed to trunk)
>>> https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
>>>
>>>  CASSANDRA-15452: Improve disk access patterns during compaction and
>>> range reads (PR available) https://github.com/apache/cassandra/pull/3606
>>>
>>> Thanks,
>>>
>>> – Scott
>>>
>>> On Feb 12, 2025, at 9:45 PM, guo Maxwell <[email protected]> wrote:
>>>
>>>
>>> Of course, I definitely hope to see it merged into 5.0.x as soon as
>>> possible
>>>
>>> Jordan West <[email protected]> 于2025年2月13日周四 10:48写道：
>>>
>>>> Regarding the buffer size, it is configurable. My personal take is that
>>>> we’ve tested this on a variety of hardware (from laptops to large instance
>>>> sizes) already, as well as a few different disk configs (it’s also been run
>>>> internally, in test, at a few places) and that it has been reviewed by four
>>>> committers and another contributor. Always love to see more numbers. if
>>>> folks want to take it for a spin on Alibaba cloud, azure, etc and determine
>>>> the best buffer size that’s awesome. We could document which is suggested
>>>> for the community. I don’t think it’s necessary to block on that however.
>>>>
>>>> Also I am of course +1 to including this in 5.0.
>>>>
>>>> Jordan
>>>>
>>>> On Wed, Feb 12, 2025 at 19:50 guo Maxwell <[email protected]> wrote:
>>>>
>>>>> What I understand is that there will be some differences in block
>>>>> storage among various cloud platforms. More intuitively, the default
>>>>> read-ahead size will be the same. For example, AWS ebs seems to be 256K,
>>>>> and Alibaba Cloud seems to be 512K（If I remember correctly).
>>>>>
>>>>> Just like 19488, give the test method, see who can assist in the test
>>>>> , and provide the results.
>>>>>
>>>>> Jon Haddad <[email protected]> 于2025年2月13日周四 08:30写道：
>>>>>
>>>>>> Can you elaborate why?  This would be several hundred hours of work
>>>>>> and would cost me thousands of $$ to perform.
>>>>>>
>>>>>> Filesystems and block devices are well understood.  Could you give me
>>>>>> an example of what you think might be different here?  This is already 
>>>>>> one
>>>>>> of the most well tested and documented performance patches ever 
>>>>>> contributed
>>>>>> to the project.
>>>>>>
>>>>>> On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>  I think it should be tested on most cloud platforms（at least
>>>>>>> aws、azure、gcp） before merged into 5.0 . Just like  CASSANDRA-19488.
>>>>>>>
>>>>>>> Paulo Motta <[email protected]>于2025年2月13日 周四上午6:10写道：
>>>>>>>
>>>>>>>> I'm looking forward to these improvements, compaction needs tlc. :-)
>>>>>>>> A couple of questions:
>>>>>>>>
>>>>>>>> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc?
>>>>>>>> My
>>>>>>>> only concern is if this is an optimization for EBS that can be a
>>>>>>>> deoptimization for other environments.
>>>>>>>>
>>>>>>>> Are there reproducible scripts that anyone can run to verify the
>>>>>>>> improvements in their own environments ? This could help alleviate
>>>>>>>> any
>>>>>>>> concerns and gain confidence to introduce a perf. improvement in a
>>>>>>>> patch release.
>>>>>>>>
>>>>>>>> I have not read the ticket in detail, so apologies if this was
>>>>>>>> already
>>>>>>>> discussed there or elsewhere.
>>>>>>>>
>>>>>>>> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Hey folks,
>>>>>>>> >
>>>>>>>> > Over the last 9 months Jordan and I have worked on
>>>>>>>> CASSANDRA-15452 [1].  The TL;DR is that we're internalizing a read 
>>>>>>>> ahead
>>>>>>>> buffer to allow us to do fewer requests to disk during compaction and 
>>>>>>>> range
>>>>>>>> reads.  This results in far fewer system calls (roughly 16x reduction) 
>>>>>>>> and
>>>>>>>> on systems with higher read latency, a significant improvement in
>>>>>>>> compaction throughput.  We've tested several different EBS 
>>>>>>>> configurations
>>>>>>>> and found it delivers up to a 10x improvement when read ahead is 
>>>>>>>> optimized
>>>>>>>> to minimize read latency.  I worked with AWS and the EBS team directly 
>>>>>>>> on
>>>>>>>> this and the Best Practices for C* on EBS [2] I wrote for them.  I've
>>>>>>>> performance tested this patch extensively with hundreds of billions of
>>>>>>>> operations across several clusters and thousands of compactions.  It 
>>>>>>>> has
>>>>>>>> less of an impact on local NVMe, since the p99 latency is already 
>>>>>>>> 10-30x
>>>>>>>> less than what you see on EBS (100micros vs 1-3ms), and you can do 
>>>>>>>> hundreds
>>>>>>>> of thousands of IOPS vs a max of 16K.
>>>>>>>> >
>>>>>>>> > Related to this, Branimir wrote CASSANDRA-20092 [3], which
>>>>>>>> significantly improves compaction by avoiding reading the partition 
>>>>>>>> index.
>>>>>>>> CASSANDRA-20092 has been merged to trunk already [4].
>>>>>>>> >
>>>>>>>> > I think we should merge both of these patches into 5.0, as the
>>>>>>>> perf improvement should allow teams to increase density of EBS backed 
>>>>>>>> C*
>>>>>>>> clusters by 2-5x, driving cost way down.  There's a lot of teams 
>>>>>>>> running C*
>>>>>>>> on EBS now.  I'm currently working with one that's bottlenecked on 
>>>>>>>> maxed
>>>>>>>> out EBS GP3 storage.  I propose we merge both, because without
>>>>>>>> CASSANDRA-20092, we won't get the performance improvements in
>>>>>>>> CASSANDRA-15452 with BTI, only BIG format.  I've tested BTI in other
>>>>>>>> situations and found it to be far more performant than BIG.
>>>>>>>> >
>>>>>>>> > If we were looking at a small win, I wouldn't care much, but
>>>>>>>> since these patches, combined with UCS, allows more teams to run C* on 
>>>>>>>> EBS
>>>>>>>> at > 10TB / node, I think it's worth doing now.
>>>>>>>> >
>>>>>>>> > Thanks in advance,
>>>>>>>> > Jon
>>>>>>>> >
>>>>>>>> > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452
>>>>>>>> > [2]
>>>>>>>> https://aws.amazon.com/blogs/database/best-practices-for-running-apache-cassandra-with-amazon-ebs/
>>>>>>>> > [3] https://issues.apache.org/jira/browse/CASSANDRA-20092
>>>>>>>> > [4]
>>>>>>>> https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>
>
> --
> Dmitry Konstantinov
>

Re: Merging compaction improvements to 5.0

Reply via email to