+1 - Thanks for doing the work to figure this out and find a good fix. Doug
> On Feb 13, 2025, at 11:28 AM, Patrick McFadin <pmcfa...@gmail.com> wrote: > > I’ve been following this for a while and I think it’s just some solid > engineering based on real-world challenges. Probably one of the best types of > contributions to have. I’m +1 on adding it to 5 > > Patrick > > On Thu, Feb 13, 2025 at 7:31 AM Dmitry Konstantinov <netud...@gmail.com > <mailto:netud...@gmail.com>> wrote: >> +1 (nb) from my side, I raised a few comments for CASSANDRA-15452 some time >> ago and Jordan addressed them. >> I have also backported CASSANDRA-15452 changes to my internal 4.1 fork and >> got about 15% reduction in compaction time even for a node with a local SSD. >> >> On Thu, 13 Feb 2025 at 13:22, Jordan West <jw...@apache.org >> <mailto:jw...@apache.org>> wrote: >>> For 15452 that’s correct (and I believe also for 20092). For 15452, the >>> trunk and 5.0 patch are basically identical. >>> >>> Jordan >>> >>> On Thu, Feb 13, 2025 at 01:06 C. Scott Andreas <sc...@paradoxica.net >>> <mailto:sc...@paradoxica.net>> wrote: >>>> Checking to confirm the specific patches proposed for backport – is it the >>>> trunk commit for C-20092 and the open GitHub PR against the 5.0 branch for >>>> C-15452 linked below? >>>> >>>> CASSANDRA-20092: Introduce SSTableSimpleScanner for compaction (committed >>>> to trunk) >>>> https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f >>>> >>>> CASSANDRA-15452: Improve disk access patterns during compaction and range >>>> reads (PR available) https://github.com/apache/cassandra/pull/3606 >>>> >>>> Thanks, >>>> >>>> – Scott >>>> >>>>> On Feb 12, 2025, at 9:45 PM, guo Maxwell <cclive1...@gmail.com >>>>> <mailto:cclive1...@gmail.com>> wrote: >>>>> >>>>> >>>>> Of course, I definitely hope to see it merged into 5.0.x as soon as >>>>> possible >>>>> >>>>> Jordan West <jw...@apache.org <mailto:jw...@apache.org>> 于2025年2月13日周四 >>>>> 10:48写道: >>>>>> Regarding the buffer size, it is configurable. My personal take is that >>>>>> we’ve tested this on a variety of hardware (from laptops to large >>>>>> instance sizes) already, as well as a few different disk configs (it’s >>>>>> also been run internally, in test, at a few places) and that it has been >>>>>> reviewed by four committers and another contributor. Always love to see >>>>>> more numbers. if folks want to take it for a spin on Alibaba cloud, >>>>>> azure, etc and determine the best buffer size that’s awesome. We could >>>>>> document which is suggested for the community. I don’t think it’s >>>>>> necessary to block on that however. >>>>>> >>>>>> Also I am of course +1 to including this in 5.0. >>>>>> >>>>>> Jordan >>>>>> >>>>>> On Wed, Feb 12, 2025 at 19:50 guo Maxwell <cclive1...@gmail.com >>>>>> <mailto:cclive1...@gmail.com>> wrote: >>>>>>> What I understand is that there will be some differences in block >>>>>>> storage among various cloud platforms. More intuitively, the default >>>>>>> read-ahead size will be the same. For example, AWS ebs seems to be >>>>>>> 256K, and Alibaba Cloud seems to be 512K(If I remember correctly). >>>>>>> >>>>>>> Just like 19488, give the test method, see who can assist in the test , >>>>>>> and provide the results. >>>>>>> >>>>>>> Jon Haddad <j...@rustyrazorblade.com <mailto:j...@rustyrazorblade.com>> >>>>>>> 于2025年2月13日周四 08:30写道: >>>>>>>> Can you elaborate why? This would be several hundred hours of work >>>>>>>> and would cost me thousands of $$ to perform. >>>>>>>> >>>>>>>> Filesystems and block devices are well understood. Could you give me >>>>>>>> an example of what you think might be different here? This is already >>>>>>>> one of the most well tested and documented performance patches ever >>>>>>>> contributed to the project. >>>>>>>> >>>>>>>> On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell <cclive1...@gmail.com >>>>>>>> <mailto:cclive1...@gmail.com>> wrote: >>>>>>>>> I think it should be tested on most cloud platforms(at least >>>>>>>>> aws、azure、gcp) before merged into 5.0 . Just like CASSANDRA-19488. >>>>>>>>> >>>>>>>>> Paulo Motta <pa...@apache.org <mailto:pa...@apache.org>>于2025年2月13日 >>>>>>>>> 周四上午6:10写道: >>>>>>>>>> I'm looking forward to these improvements, compaction needs tlc. :-) >>>>>>>>>> A couple of questions: >>>>>>>>>> >>>>>>>>>> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? >>>>>>>>>> My >>>>>>>>>> only concern is if this is an optimization for EBS that can be a >>>>>>>>>> deoptimization for other environments. >>>>>>>>>> >>>>>>>>>> Are there reproducible scripts that anyone can run to verify the >>>>>>>>>> improvements in their own environments ? This could help alleviate >>>>>>>>>> any >>>>>>>>>> concerns and gain confidence to introduce a perf. improvement in a >>>>>>>>>> patch release. >>>>>>>>>> >>>>>>>>>> I have not read the ticket in detail, so apologies if this was >>>>>>>>>> already >>>>>>>>>> discussed there or elsewhere. >>>>>>>>>> >>>>>>>>>> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad <j...@rustyrazorblade.com >>>>>>>>>> <mailto:j...@rustyrazorblade.com>> wrote: >>>>>>>>>> > >>>>>>>>>> > Hey folks, >>>>>>>>>> > >>>>>>>>>> > Over the last 9 months Jordan and I have worked on CASSANDRA-15452 >>>>>>>>>> > [1]. The TL;DR is that we're internalizing a read ahead buffer to >>>>>>>>>> > allow us to do fewer requests to disk during compaction and range >>>>>>>>>> > reads. This results in far fewer system calls (roughly 16x >>>>>>>>>> > reduction) and on systems with higher read latency, a significant >>>>>>>>>> > improvement in compaction throughput. We've tested several >>>>>>>>>> > different EBS configurations and found it delivers up to a 10x >>>>>>>>>> > improvement when read ahead is optimized to minimize read latency. >>>>>>>>>> > I worked with AWS and the EBS team directly on this and the Best >>>>>>>>>> > Practices for C* on EBS [2] I wrote for them. I've performance >>>>>>>>>> > tested this patch extensively with hundreds of billions of >>>>>>>>>> > operations across several clusters and thousands of compactions. >>>>>>>>>> > It has less of an impact on local NVMe, since the p99 latency is >>>>>>>>>> > already 10-30x less than what you see on EBS (100micros vs 1-3ms), >>>>>>>>>> > and you can do hundreds of thousands of IOPS vs a max of 16K. >>>>>>>>>> > >>>>>>>>>> > Related to this, Branimir wrote CASSANDRA-20092 [3], which >>>>>>>>>> > significantly improves compaction by avoiding reading the >>>>>>>>>> > partition index. CASSANDRA-20092 has been merged to trunk already >>>>>>>>>> > [4]. >>>>>>>>>> > >>>>>>>>>> > I think we should merge both of these patches into 5.0, as the >>>>>>>>>> > perf improvement should allow teams to increase density of EBS >>>>>>>>>> > backed C* clusters by 2-5x, driving cost way down. There's a lot >>>>>>>>>> > of teams running C* on EBS now. I'm currently working with one >>>>>>>>>> > that's bottlenecked on maxed out EBS GP3 storage. I propose we >>>>>>>>>> > merge both, because without CASSANDRA-20092, we won't get the >>>>>>>>>> > performance improvements in CASSANDRA-15452 with BTI, only BIG >>>>>>>>>> > format. I've tested BTI in other situations and found it to be >>>>>>>>>> > far more performant than BIG. >>>>>>>>>> > >>>>>>>>>> > If we were looking at a small win, I wouldn't care much, but since >>>>>>>>>> > these patches, combined with UCS, allows more teams to run C* on >>>>>>>>>> > EBS at > 10TB / node, I think it's worth doing now. >>>>>>>>>> > >>>>>>>>>> > Thanks in advance, >>>>>>>>>> > Jon >>>>>>>>>> > >>>>>>>>>> > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452 >>>>>>>>>> > [2] >>>>>>>>>> > https://aws.amazon.com/blogs/database/best-practices-for-running-apache-cassandra-with-amazon-ebs/ >>>>>>>>>> > [3] https://issues.apache.org/jira/browse/CASSANDRA-20092 >>>>>>>>>> > [4] >>>>>>>>>> > https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f >>>>>>>>>> > >>>> >> >> >> >> -- >> Dmitry Konstantinov