Re: Merging compaction improvements to 5.0

Doug Rohrer Thu, 13 Feb 2025 12:06:45 -0800

+1 - Thanks for doing the work to figure this out and find a good fix.

Doug


> On Feb 13, 2025, at 11:28 AM, Patrick McFadin <[email protected]> wrote:
> 
> I’ve been following this for a while and I think it’s just some solid 
> engineering based on real-world challenges. Probably one of the best types of 
> contributions to have. I’m +1 on adding it to 5
> 
> Patrick
> 
> On Thu, Feb 13, 2025 at 7:31 AM Dmitry Konstantinov <[email protected] 
> <mailto:[email protected]>> wrote:
>> +1 (nb) from my side, I raised a few comments for CASSANDRA-15452 some time 
>> ago and Jordan addressed them.
>> I have also backported CASSANDRA-15452 changes to my internal 4.1 fork and 
>> got about 15% reduction in compaction time even for a node with a local SSD.
>> 
>> On Thu, 13 Feb 2025 at 13:22, Jordan West <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> For 15452 that’s correct (and I believe also for 20092). For 15452, the 
>>> trunk and 5.0 patch are basically identical. 
>>> 
>>> Jordan 
>>> 
>>> On Thu, Feb 13, 2025 at 01:06 C. Scott Andreas <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> Checking to confirm the specific patches proposed for backport – is it the 
>>>> trunk commit for C-20092 and the open GitHub PR against the 5.0 branch for 
>>>> C-15452 linked below?
>>>> 
>>>> CASSANDRA-20092: Introduce SSTableSimpleScanner for compaction (committed 
>>>> to trunk) 
>>>> https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
>>>> 
>>>>  CASSANDRA-15452: Improve disk access patterns during compaction and range 
>>>> reads (PR available) https://github.com/apache/cassandra/pull/3606
>>>> 
>>>> Thanks,
>>>> 
>>>> – Scott
>>>> 
>>>>> On Feb 12, 2025, at 9:45 PM, guo Maxwell <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> 
>>>>> Of course, I definitely hope to see it merged into 5.0.x as soon as 
>>>>> possible
>>>>> 
>>>>> Jordan West <[email protected] <mailto:[email protected]>> 于2025年2月13日周四 
>>>>> 10:48写道：
>>>>>> Regarding the buffer size, it is configurable. My personal take is that 
>>>>>> we’ve tested this on a variety of hardware (from laptops to large 
>>>>>> instance sizes) already, as well as a few different disk configs (it’s 
>>>>>> also been run internally, in test, at a few places) and that it has been 
>>>>>> reviewed by four committers and another contributor. Always love to see 
>>>>>> more numbers. if folks want to take it for a spin on Alibaba cloud, 
>>>>>> azure, etc and determine the best buffer size that’s awesome. We could 
>>>>>> document which is suggested for the community. I don’t think it’s 
>>>>>> necessary to block on that however. 
>>>>>> 
>>>>>> Also I am of course +1 to including this in 5.0. 
>>>>>> 
>>>>>> Jordan 
>>>>>> 
>>>>>> On Wed, Feb 12, 2025 at 19:50 guo Maxwell <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> What I understand is that there will be some differences in block 
>>>>>>> storage among various cloud platforms. More intuitively, the default 
>>>>>>> read-ahead size will be the same. For example, AWS ebs seems to be 
>>>>>>> 256K, and Alibaba Cloud seems to be 512K（If I remember correctly).
>>>>>>> 
>>>>>>> Just like 19488, give the test method, see who can assist in the test , 
>>>>>>> and provide the results.  
>>>>>>> 
>>>>>>> Jon Haddad <[email protected] <mailto:[email protected]>> 
>>>>>>> 于2025年2月13日周四 08:30写道：
>>>>>>>> Can you elaborate why?  This would be several hundred hours of work 
>>>>>>>> and would cost me thousands of $$ to perform.
>>>>>>>> 
>>>>>>>> Filesystems and block devices are well understood.  Could you give me 
>>>>>>>> an example of what you think might be different here?  This is already 
>>>>>>>> one of the most well tested and documented performance patches ever 
>>>>>>>> contributed to the project.
>>>>>>>> 
>>>>>>>> On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>  I think it should be tested on most cloud platforms（at least 
>>>>>>>>> aws、azure、gcp） before merged into 5.0 . Just like  CASSANDRA-19488.
>>>>>>>>> 
>>>>>>>>> Paulo Motta <[email protected] <mailto:[email protected]>>于2025年2月13日 
>>>>>>>>> 周四上午6:10写道：
>>>>>>>>>> I'm looking forward to these improvements, compaction needs tlc. :-)
>>>>>>>>>> A couple of questions:
>>>>>>>>>> 
>>>>>>>>>> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? 
>>>>>>>>>> My
>>>>>>>>>> only concern is if this is an optimization for EBS that can be a
>>>>>>>>>> deoptimization for other environments.
>>>>>>>>>> 
>>>>>>>>>> Are there reproducible scripts that anyone can run to verify the
>>>>>>>>>> improvements in their own environments ? This could help alleviate 
>>>>>>>>>> any
>>>>>>>>>> concerns and gain confidence to introduce a perf. improvement in a
>>>>>>>>>> patch release.
>>>>>>>>>> 
>>>>>>>>>> I have not read the ticket in detail, so apologies if this was 
>>>>>>>>>> already
>>>>>>>>>> discussed there or elsewhere.
>>>>>>>>>> 
>>>>>>>>>> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad <[email protected] 
>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> >
>>>>>>>>>> > Hey folks,
>>>>>>>>>> >
>>>>>>>>>> > Over the last 9 months Jordan and I have worked on CASSANDRA-15452 
>>>>>>>>>> > [1].  The TL;DR is that we're internalizing a read ahead buffer to 
>>>>>>>>>> > allow us to do fewer requests to disk during compaction and range 
>>>>>>>>>> > reads.  This results in far fewer system calls (roughly 16x 
>>>>>>>>>> > reduction) and on systems with higher read latency, a significant 
>>>>>>>>>> > improvement in compaction throughput.  We've tested several 
>>>>>>>>>> > different EBS configurations and found it delivers up to a 10x 
>>>>>>>>>> > improvement when read ahead is optimized to minimize read latency. 
>>>>>>>>>> >  I worked with AWS and the EBS team directly on this and the Best 
>>>>>>>>>> > Practices for C* on EBS [2] I wrote for them.  I've performance 
>>>>>>>>>> > tested this patch extensively with hundreds of billions of 
>>>>>>>>>> > operations across several clusters and thousands of compactions.  
>>>>>>>>>> > It has less of an impact on local NVMe, since the p99 latency is 
>>>>>>>>>> > already 10-30x less than what you see on EBS (100micros vs 1-3ms), 
>>>>>>>>>> > and you can do hundreds of thousands of IOPS vs a max of 16K.
>>>>>>>>>> >
>>>>>>>>>> > Related to this, Branimir wrote CASSANDRA-20092 [3], which 
>>>>>>>>>> > significantly improves compaction by avoiding reading the 
>>>>>>>>>> > partition index.  CASSANDRA-20092 has been merged to trunk already 
>>>>>>>>>> > [4].
>>>>>>>>>> >
>>>>>>>>>> > I think we should merge both of these patches into 5.0, as the 
>>>>>>>>>> > perf improvement should allow teams to increase density of EBS 
>>>>>>>>>> > backed C* clusters by 2-5x, driving cost way down.  There's a lot 
>>>>>>>>>> > of teams running C* on EBS now.  I'm currently working with one 
>>>>>>>>>> > that's bottlenecked on maxed out EBS GP3 storage.  I propose we 
>>>>>>>>>> > merge both, because without CASSANDRA-20092, we won't get the 
>>>>>>>>>> > performance improvements in CASSANDRA-15452 with BTI, only BIG 
>>>>>>>>>> > format.  I've tested BTI in other situations and found it to be 
>>>>>>>>>> > far more performant than BIG.
>>>>>>>>>> >
>>>>>>>>>> > If we were looking at a small win, I wouldn't care much, but since 
>>>>>>>>>> > these patches, combined with UCS, allows more teams to run C* on 
>>>>>>>>>> > EBS at > 10TB / node, I think it's worth doing now.
>>>>>>>>>> >
>>>>>>>>>> > Thanks in advance,
>>>>>>>>>> > Jon
>>>>>>>>>> >
>>>>>>>>>> > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452
>>>>>>>>>> > [2] 
>>>>>>>>>> > https://aws.amazon.com/blogs/database/best-practices-for-running-apache-cassandra-with-amazon-ebs/
>>>>>>>>>> > [3] https://issues.apache.org/jira/browse/CASSANDRA-20092
>>>>>>>>>> > [4] 
>>>>>>>>>> > https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
>>>>>>>>>> >
>>>> 
>> 
>> 
>> 
>> --
>> Dmitry Konstantinov

Re: Merging compaction improvements to 5.0

Reply via email to