I’ve been following this for a while and I think it’s just some solid engineering based on real-world challenges. Probably one of the best types of contributions to have. I’m +1 on adding it to 5
Patrick On Thu, Feb 13, 2025 at 7:31 AM Dmitry Konstantinov <netud...@gmail.com> wrote: > +1 (nb) from my side, I raised a few comments for CASSANDRA-15452 some > time ago and Jordan addressed them. > I have also backported CASSANDRA-15452 changes to my internal 4.1 fork and > got about 15% reduction in compaction time even for a node with a local SSD. > > On Thu, 13 Feb 2025 at 13:22, Jordan West <jw...@apache.org> wrote: > >> For 15452 that’s correct (and I believe also for 20092). For 15452, the >> trunk and 5.0 patch are basically identical. >> >> Jordan >> >> On Thu, Feb 13, 2025 at 01:06 C. Scott Andreas <sc...@paradoxica.net> >> wrote: >> >>> Checking to confirm the specific patches proposed for backport – is it >>> the trunk commit for C-20092 and the open GitHub PR against the 5.0 branch >>> for C-15452 linked below? >>> >>> CASSANDRA-20092: Introduce SSTableSimpleScanner for compaction >>> (committed to trunk) >>> https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f >>> >>> CASSANDRA-15452: Improve disk access patterns during compaction and >>> range reads (PR available) https://github.com/apache/cassandra/pull/3606 >>> >>> Thanks, >>> >>> – Scott >>> >>> On Feb 12, 2025, at 9:45 PM, guo Maxwell <cclive1...@gmail.com> wrote: >>> >>> >>> Of course, I definitely hope to see it merged into 5.0.x as soon as >>> possible >>> >>> Jordan West <jw...@apache.org> 于2025年2月13日周四 10:48写道: >>> >>>> Regarding the buffer size, it is configurable. My personal take is that >>>> we’ve tested this on a variety of hardware (from laptops to large instance >>>> sizes) already, as well as a few different disk configs (it’s also been run >>>> internally, in test, at a few places) and that it has been reviewed by four >>>> committers and another contributor. Always love to see more numbers. if >>>> folks want to take it for a spin on Alibaba cloud, azure, etc and determine >>>> the best buffer size that’s awesome. We could document which is suggested >>>> for the community. I don’t think it’s necessary to block on that however. >>>> >>>> Also I am of course +1 to including this in 5.0. >>>> >>>> Jordan >>>> >>>> On Wed, Feb 12, 2025 at 19:50 guo Maxwell <cclive1...@gmail.com> wrote: >>>> >>>>> What I understand is that there will be some differences in block >>>>> storage among various cloud platforms. More intuitively, the default >>>>> read-ahead size will be the same. For example, AWS ebs seems to be 256K, >>>>> and Alibaba Cloud seems to be 512K(If I remember correctly). >>>>> >>>>> Just like 19488, give the test method, see who can assist in the test >>>>> , and provide the results. >>>>> >>>>> Jon Haddad <j...@rustyrazorblade.com> 于2025年2月13日周四 08:30写道: >>>>> >>>>>> Can you elaborate why? This would be several hundred hours of work >>>>>> and would cost me thousands of $$ to perform. >>>>>> >>>>>> Filesystems and block devices are well understood. Could you give me >>>>>> an example of what you think might be different here? This is already >>>>>> one >>>>>> of the most well tested and documented performance patches ever >>>>>> contributed >>>>>> to the project. >>>>>> >>>>>> On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell <cclive1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I think it should be tested on most cloud platforms(at least >>>>>>> aws、azure、gcp) before merged into 5.0 . Just like CASSANDRA-19488. >>>>>>> >>>>>>> Paulo Motta <pa...@apache.org>于2025年2月13日 周四上午6:10写道: >>>>>>> >>>>>>>> I'm looking forward to these improvements, compaction needs tlc. :-) >>>>>>>> A couple of questions: >>>>>>>> >>>>>>>> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? >>>>>>>> My >>>>>>>> only concern is if this is an optimization for EBS that can be a >>>>>>>> deoptimization for other environments. >>>>>>>> >>>>>>>> Are there reproducible scripts that anyone can run to verify the >>>>>>>> improvements in their own environments ? This could help alleviate >>>>>>>> any >>>>>>>> concerns and gain confidence to introduce a perf. improvement in a >>>>>>>> patch release. >>>>>>>> >>>>>>>> I have not read the ticket in detail, so apologies if this was >>>>>>>> already >>>>>>>> discussed there or elsewhere. >>>>>>>> >>>>>>>> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad <j...@rustyrazorblade.com> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > Hey folks, >>>>>>>> > >>>>>>>> > Over the last 9 months Jordan and I have worked on >>>>>>>> CASSANDRA-15452 [1]. The TL;DR is that we're internalizing a read >>>>>>>> ahead >>>>>>>> buffer to allow us to do fewer requests to disk during compaction and >>>>>>>> range >>>>>>>> reads. This results in far fewer system calls (roughly 16x reduction) >>>>>>>> and >>>>>>>> on systems with higher read latency, a significant improvement in >>>>>>>> compaction throughput. We've tested several different EBS >>>>>>>> configurations >>>>>>>> and found it delivers up to a 10x improvement when read ahead is >>>>>>>> optimized >>>>>>>> to minimize read latency. I worked with AWS and the EBS team directly >>>>>>>> on >>>>>>>> this and the Best Practices for C* on EBS [2] I wrote for them. I've >>>>>>>> performance tested this patch extensively with hundreds of billions of >>>>>>>> operations across several clusters and thousands of compactions. It >>>>>>>> has >>>>>>>> less of an impact on local NVMe, since the p99 latency is already >>>>>>>> 10-30x >>>>>>>> less than what you see on EBS (100micros vs 1-3ms), and you can do >>>>>>>> hundreds >>>>>>>> of thousands of IOPS vs a max of 16K. >>>>>>>> > >>>>>>>> > Related to this, Branimir wrote CASSANDRA-20092 [3], which >>>>>>>> significantly improves compaction by avoiding reading the partition >>>>>>>> index. >>>>>>>> CASSANDRA-20092 has been merged to trunk already [4]. >>>>>>>> > >>>>>>>> > I think we should merge both of these patches into 5.0, as the >>>>>>>> perf improvement should allow teams to increase density of EBS backed >>>>>>>> C* >>>>>>>> clusters by 2-5x, driving cost way down. There's a lot of teams >>>>>>>> running C* >>>>>>>> on EBS now. I'm currently working with one that's bottlenecked on >>>>>>>> maxed >>>>>>>> out EBS GP3 storage. I propose we merge both, because without >>>>>>>> CASSANDRA-20092, we won't get the performance improvements in >>>>>>>> CASSANDRA-15452 with BTI, only BIG format. I've tested BTI in other >>>>>>>> situations and found it to be far more performant than BIG. >>>>>>>> > >>>>>>>> > If we were looking at a small win, I wouldn't care much, but >>>>>>>> since these patches, combined with UCS, allows more teams to run C* on >>>>>>>> EBS >>>>>>>> at > 10TB / node, I think it's worth doing now. >>>>>>>> > >>>>>>>> > Thanks in advance, >>>>>>>> > Jon >>>>>>>> > >>>>>>>> > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452 >>>>>>>> > [2] >>>>>>>> https://aws.amazon.com/blogs/database/best-practices-for-running-apache-cassandra-with-amazon-ebs/ >>>>>>>> > [3] https://issues.apache.org/jira/browse/CASSANDRA-20092 >>>>>>>> > [4] >>>>>>>> https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f >>>>>>>> > >>>>>>>> >>>>>>> >>> > > -- > Dmitry Konstantinov >