Thanks Jon for the additional feedback. I will take a look at the ticket more closely and try to reproduce the claimed improvements on my laptop.
If there's no regression in performance, I'm +1 in including this improvement in 5.0. On Wed, Feb 12, 2025 at 7:28 PM Jon Haddad <j...@rustyrazorblade.com> wrote: > > Hey Paulo, > > Great questions. I've tested the patch fairly extensively across a wide > variety of AWS hardware types both EBS and not. I believe Dave Capwell > tested it using infra he had available. > > In every case I've looked at, it's been a win, or on NVMe barely a change. > The reason for this is that we're fetching data from a local byte array > instead of the page cache. There's no circumstance where it can be faster to > get data out of page cache than sequentially fetching bytes out of a byte > array. > > In the ticket I've provided extensive documentation showing how to repo using > easy-cass-lab and easy-cass-stress. I've shown how to watch the filesystem > and block device for individual reads (xfsslower & biosnoop), you can see > each filesystem access, how many bytes were fetched, and how long it took. > I've included what I think is a fairly comprehensive analysis of the effects > of the patch. I accounted for differences in instance types by switching the > C* version from stock 5.0 to 15452 patched. > > I've tried to use this as an opportunity to demonstrate what I think is the > level of detail that a patch like this should have, so I hope you get a > chance to take the time to check out the JIRA. There's 100x more detail > there than I've provided in this email. > > Jon > > > On Wed, Feb 12, 2025 at 2:10 PM Paulo Motta <pa...@apache.org> wrote: >> >> I'm looking forward to these improvements, compaction needs tlc. :-) >> A couple of questions: >> >> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? My >> only concern is if this is an optimization for EBS that can be a >> deoptimization for other environments. >> >> Are there reproducible scripts that anyone can run to verify the >> improvements in their own environments ? This could help alleviate any >> concerns and gain confidence to introduce a perf. improvement in a >> patch release. >> >> I have not read the ticket in detail, so apologies if this was already >> discussed there or elsewhere. >> >> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad <j...@rustyrazorblade.com> wrote: >> > >> > Hey folks, >> > >> > Over the last 9 months Jordan and I have worked on CASSANDRA-15452 [1]. >> > The TL;DR is that we're internalizing a read ahead buffer to allow us to >> > do fewer requests to disk during compaction and range reads. This results >> > in far fewer system calls (roughly 16x reduction) and on systems with >> > higher read latency, a significant improvement in compaction throughput. >> > We've tested several different EBS configurations and found it delivers up >> > to a 10x improvement when read ahead is optimized to minimize read >> > latency. I worked with AWS and the EBS team directly on this and the Best >> > Practices for C* on EBS [2] I wrote for them. I've performance tested >> > this patch extensively with hundreds of billions of operations across >> > several clusters and thousands of compactions. It has less of an impact >> > on local NVMe, since the p99 latency is already 10-30x less than what you >> > see on EBS (100micros vs 1-3ms), and you can do hundreds of thousands of >> > IOPS vs a max of 16K. >> > >> > Related to this, Branimir wrote CASSANDRA-20092 [3], which significantly >> > improves compaction by avoiding reading the partition index. >> > CASSANDRA-20092 has been merged to trunk already [4]. >> > >> > I think we should merge both of these patches into 5.0, as the perf >> > improvement should allow teams to increase density of EBS backed C* >> > clusters by 2-5x, driving cost way down. There's a lot of teams running >> > C* on EBS now. I'm currently working with one that's bottlenecked on >> > maxed out EBS GP3 storage. I propose we merge both, because without >> > CASSANDRA-20092, we won't get the performance improvements in >> > CASSANDRA-15452 with BTI, only BIG format. I've tested BTI in other >> > situations and found it to be far more performant than BIG. >> > >> > If we were looking at a small win, I wouldn't care much, but since these >> > patches, combined with UCS, allows more teams to run C* on EBS at > 10TB / >> > node, I think it's worth doing now. >> > >> > Thanks in advance, >> > Jon >> > >> > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452 >> > [2] >> > https://aws.amazon.com/blogs/database/best-practices-for-running-apache-cassandra-with-amazon-ebs/ >> > [3] https://issues.apache.org/jira/browse/CASSANDRA-20092 >> > [4] >> > https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f >> >