Hi Mike

Try reading the discussions on CASSANDRA-8894
<https://issues.apache.org/jira/browse/CASSANDRA-8894>, plus related
tickets, and CASSANDRA-8630
<https://issues.apache.org/jira/browse/CASSANDRA-8630>, both delivered in
3.0.

This code changed significantly in 3.0, and then again more recently in the
latest 3.x releases, and it is still in the process of changing. I am more
familiar with the 3.0 changes than the recent ones,  but if you are
interested in deep diving in the code, and if you specify which version, I
can be more precise and point you to the relevant code sections.

The quick summary is that, for standard access mode, the buffer size is set
to 64k if there is a limiter, even if its throughput is unlimited, which
means that for compaction and other seq. ops that use a limiter, the buffer
size will be 64k whilst for other read ops the buffer size, in 3.0+, is
calculated in SegmentedFile.bufferSize()
<https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/util/SegmentedFile.java#L226>
and it is based on the average record size, the disk type and two other
configurable parameters: the chance that a record will cross an aligned
cache page, and the percentile used to estimate the record size. You find
these 3 parameters here
<https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/config/Config.java#L261>
and you may try optimizing them given that the ticket for benchmarking them,
CASSANDRA-10407 <https://issues.apache.org/jira/browse/CASSANDRA-10407>, is
still open. The buffer size will always be rounded to multiples of the
cache page size, 4k.

It is currently not possible to use mmap for one type of reads and std for
another type, as far as I know.

On Sun, Jun 19, 2016 at 6:17 PM, Mike Heffner <m...@librato.com> wrote:

> Hi,
>
> I'm curious to know if anyone has attempted to improve read IOPS
> performance during sequential I/O operations (largely compactions) while
> still maintaining read performance for small row, random-access client
> reads?
>
> Our use case is very high write load to read load ratio with rows that are
> small (< 1-2kb). We've taken many of the steps to ensure that client reads
> to random-access rows are optimal by reducing read_ahead and even using a
> smaller default LZ4 chunk size. So far performance has been great with p95
> read times that are < 10ms.
>
> However, we have noticed that our total read IOPS to the Cassandra data
> drive is extremely high compared to our write IOPS, almost 15x the write
> IOPS to the same drive. We even setup a ring that took the same write load
> with zero client reads and observed that the high read IOPS were driven by
> compaction operations. During large (>50GB) compactions, write and read
> volume (bytes) were nearly identical which matched our assumptions, while
> read iops were 15x write iops.
>
> When we plotted the average read and write op size we saw an average read
> ops size of just under 5KB and average write op of 120KB. Given we are
> using the default disk access mode of mmap, this aligns with our assumption
> that we are paging in a single 4KB page at a time while the write size is
> coalescing write flushes. We wanted to test this, so we switched a single
> node to `disk_access_mode:standard`, which should do reads based on the
> chunksizes, and found that read op size increased to ~7.5KB:
>
> https://imgur.com/okbfFby
>
> We don't want to sacrifice our read performance, but we also must
> scale/size our disk performance based on peak iops. If we could cut the
> read iops by a quarter or even half during compaction operations, that
> would mean a large cost savings. We are also limited on drive throughput,
> so there's a theoretical maximum op size we'd want to use to stay under
> that throughput limit. Alternatively, we could also tune compaction
> throughput to maintain that limit too.
>
> Has any work been done to optimize sequential I/O operations in Cassandra?
> Naively it seems that sequential I/O operations could use a standard disk
> access mode reader with configurable block size while normal read
> operations stuck to the mmap'd segments. Being unfamiliar with the code,
> are compaction/sequential sstable reads done through any single interface
> or does it use the same as normal read ops?
>
> Thoughts?
>
> -Mike
>
> --
>
>   Mike Heffner <m...@librato.com>
>   Librato, Inc.
>



-- 


[image: datastax_logo.png] <http://www.datastax.com/>

Stefania Alborghetti

Apache Cassandra Software Engineer

|+852 6114 9265| stefania.alborghe...@datastax.com


[image: cassandrasummit.org/Email_Signature]
<http://cassandrasummit.org/Email_Signature>

Reply via email to