Regarding the buffer size, it is configurable. My personal take is that
we’ve tested this on a variety of hardware (from laptops to large instance
sizes) already, as well as a few different disk configs (it’s also been run
internally, in test, at a few places) and that it has been reviewed by four
committers and another contributor. Always love to see more numbers. if
folks want to take it for a spin on Alibaba cloud, azure, etc and determine
the best buffer size that’s awesome. We could document which is suggested
for the community. I don’t think it’s necessary to block on that however.

Also I am of course +1 to including this in 5.0.

Jordan

On Wed, Feb 12, 2025 at 19:50 guo Maxwell <cclive1...@gmail.com> wrote:

> What I understand is that there will be some differences in block storage
> among various cloud platforms. More intuitively, the default read-ahead
> size will be the same. For example, AWS ebs seems to be 256K, and Alibaba
> Cloud seems to be 512K(If I remember correctly).
>
> Just like 19488, give the test method, see who can assist in the test ,
> and provide the results.
>
> Jon Haddad <j...@rustyrazorblade.com> 于2025年2月13日周四 08:30写道:
>
>> Can you elaborate why?  This would be several hundred hours of work and
>> would cost me thousands of $$ to perform.
>>
>> Filesystems and block devices are well understood.  Could you give me an
>> example of what you think might be different here?  This is already one of
>> the most well tested and documented performance patches ever contributed to
>> the project.
>>
>> On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell <cclive1...@gmail.com> wrote:
>>
>>>  I think it should be tested on most cloud platforms(at least
>>> aws、azure、gcp) before merged into 5.0 . Just like  CASSANDRA-19488.
>>>
>>> Paulo Motta <pa...@apache.org>于2025年2月13日 周四上午6:10写道:
>>>
>>>> I'm looking forward to these improvements, compaction needs tlc. :-)
>>>> A couple of questions:
>>>>
>>>> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? My
>>>> only concern is if this is an optimization for EBS that can be a
>>>> deoptimization for other environments.
>>>>
>>>> Are there reproducible scripts that anyone can run to verify the
>>>> improvements in their own environments ? This could help alleviate any
>>>> concerns and gain confidence to introduce a perf. improvement in a
>>>> patch release.
>>>>
>>>> I have not read the ticket in detail, so apologies if this was already
>>>> discussed there or elsewhere.
>>>>
>>>> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad <j...@rustyrazorblade.com>
>>>> wrote:
>>>> >
>>>> > Hey folks,
>>>> >
>>>> > Over the last 9 months Jordan and I have worked on CASSANDRA-15452
>>>> [1].  The TL;DR is that we're internalizing a read ahead buffer to allow us
>>>> to do fewer requests to disk during compaction and range reads.  This
>>>> results in far fewer system calls (roughly 16x reduction) and on systems
>>>> with higher read latency, a significant improvement in compaction
>>>> throughput.  We've tested several different EBS configurations and found it
>>>> delivers up to a 10x improvement when read ahead is optimized to minimize
>>>> read latency.  I worked with AWS and the EBS team directly on this and the
>>>> Best Practices for C* on EBS [2] I wrote for them.  I've performance tested
>>>> this patch extensively with hundreds of billions of operations across
>>>> several clusters and thousands of compactions.  It has less of an impact on
>>>> local NVMe, since the p99 latency is already 10-30x less than what you see
>>>> on EBS (100micros vs 1-3ms), and you can do hundreds of thousands of IOPS
>>>> vs a max of 16K.
>>>> >
>>>> > Related to this, Branimir wrote CASSANDRA-20092 [3], which
>>>> significantly improves compaction by avoiding reading the partition index.
>>>> CASSANDRA-20092 has been merged to trunk already [4].
>>>> >
>>>> > I think we should merge both of these patches into 5.0, as the perf
>>>> improvement should allow teams to increase density of EBS backed C*
>>>> clusters by 2-5x, driving cost way down.  There's a lot of teams running C*
>>>> on EBS now.  I'm currently working with one that's bottlenecked on maxed
>>>> out EBS GP3 storage.  I propose we merge both, because without
>>>> CASSANDRA-20092, we won't get the performance improvements in
>>>> CASSANDRA-15452 with BTI, only BIG format.  I've tested BTI in other
>>>> situations and found it to be far more performant than BIG.
>>>> >
>>>> > If we were looking at a small win, I wouldn't care much, but since
>>>> these patches, combined with UCS, allows more teams to run C* on EBS at >
>>>> 10TB / node, I think it's worth doing now.
>>>> >
>>>> > Thanks in advance,
>>>> > Jon
>>>> >
>>>> > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452
>>>> > [2]
>>>> https://aws.amazon.com/blogs/database/best-practices-for-running-apache-cassandra-with-amazon-ebs/
>>>> > [3] https://issues.apache.org/jira/browse/CASSANDRA-20092
>>>> > [4]
>>>> https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
>>>> >
>>>>
>>>

Reply via email to