Hi Stefania,
> Try reading the discussions on CASSANDRA-8894 > <https://issues.apache.org/jira/browse/CASSANDRA-8894>, plus related > Yes, I agree. I think the difficult is trying to get the smallest reads when performing any RAR operation, but trying to optimize for optimal read size when doing large sequential I/O operations. > tickets, and CASSANDRA-8630 > <https://issues.apache.org/jira/browse/CASSANDRA-8630>, both delivered in > 3.0. > > We saw the same behavior of a lot of small unnecessary reads when we logged read sizes in ChannelProxy. > This code changed significantly in 3.0, and then again more recently in the > latest 3.x releases, and it is still in the process of changing. I am more > familiar with the 3.0 changes than the recent ones, but if you are > interested in deep diving in the code, and if you specify which version, I > can be more precise and point you to the relevant code sections. > Unfortunately we are looking at the 2.2.6 code base right now, so it sounds like we may be off a lot from newer 3.x work. > > The quick summary is that, for standard access mode, the buffer size is set > to 64k if there is a limiter, even if its throughput is unlimited, which > means that for compaction and other seq. ops that use a limiter, the buffer > size will be 64k whilst for other read ops the buffer size, in 3.0+, is > calculated in SegmentedFile.bufferSize() > < > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/util/SegmentedFile.java#L226 > > > Ok, thanks, that's a good starting point. This is the access point for any RAR and/or compaction i/o on a compressed SST? > and it is based on the average record size, the disk type and two other > configurable parameters: the chance that a record will cross an aligned > cache page, and the percentile used to estimate the record size. You find > these 3 parameters here > < > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/config/Config.java#L261 > > > and you may try optimizing them given that the ticket for benchmarking > them, > CASSANDRA-10407 <https://issues.apache.org/jira/browse/CASSANDRA-10407>, > is > still open. The buffer size will always be rounded to multiples of the > cache page size, 4k. > > It is currently not possible to use mmap for one type of reads and std for > another type, as far as I know. > Alright, that's what I figured. It would be interesting to see what a mixed workload would look like. Thanks for the info! Cheers, Mike > On Sun, Jun 19, 2016 at 6:17 PM, Mike Heffner <m...@librato.com> wrote: > > > Hi, > > > > I'm curious to know if anyone has attempted to improve read IOPS > > performance during sequential I/O operations (largely compactions) while > > still maintaining read performance for small row, random-access client > > reads? > > > > Our use case is very high write load to read load ratio with rows that > are > > small (< 1-2kb). We've taken many of the steps to ensure that client > reads > > to random-access rows are optimal by reducing read_ahead and even using a > > smaller default LZ4 chunk size. So far performance has been great with > p95 > > read times that are < 10ms. > > > > However, we have noticed that our total read IOPS to the Cassandra data > > drive is extremely high compared to our write IOPS, almost 15x the write > > IOPS to the same drive. We even setup a ring that took the same write > load > > with zero client reads and observed that the high read IOPS were driven > by > > compaction operations. During large (>50GB) compactions, write and read > > volume (bytes) were nearly identical which matched our assumptions, while > > read iops were 15x write iops. > > > > When we plotted the average read and write op size we saw an average read > > ops size of just under 5KB and average write op of 120KB. Given we are > > using the default disk access mode of mmap, this aligns with our > assumption > > that we are paging in a single 4KB page at a time while the write size is > > coalescing write flushes. We wanted to test this, so we switched a single > > node to `disk_access_mode:standard`, which should do reads based on the > > chunksizes, and found that read op size increased to ~7.5KB: > > > > https://imgur.com/okbfFby > > > > We don't want to sacrifice our read performance, but we also must > > scale/size our disk performance based on peak iops. If we could cut the > > read iops by a quarter or even half during compaction operations, that > > would mean a large cost savings. We are also limited on drive throughput, > > so there's a theoretical maximum op size we'd want to use to stay under > > that throughput limit. Alternatively, we could also tune compaction > > throughput to maintain that limit too. > > > > Has any work been done to optimize sequential I/O operations in > Cassandra? > > Naively it seems that sequential I/O operations could use a standard disk > > access mode reader with configurable block size while normal read > > operations stuck to the mmap'd segments. Being unfamiliar with the code, > > are compaction/sequential sstable reads done through any single interface > > or does it use the same as normal read ops? > > > > Thoughts? > > > > -Mike > > > > -- > > > > Mike Heffner <m...@librato.com> > > Librato, Inc. > > > > > > -- > > > [image: datastax_logo.png] <http://www.datastax.com/> > > Stefania Alborghetti > > Apache Cassandra Software Engineer > > |+852 6114 9265| stefania.alborghe...@datastax.com > > > [image: cassandrasummit.org/Email_Signature] > <http://cassandrasummit.org/Email_Signature> > -- Mike Heffner <m...@librato.com> Librato, Inc.