Re: Optimizing IOPS during sequential I/O (compactions)

Mike Heffner Fri, 24 Jun 2016 13:11:51 -0700

Hi Stefania,


> Try reading the discussions on CASSANDRA-8894
> <https://issues.apache.org/jira/browse/CASSANDRA-8894>, plus related
>

Yes, I agree. I think the difficult is trying to get the smallest reads
when performing any RAR operation, but trying to optimize for optimal read
size when doing large sequential I/O operations.



> tickets, and CASSANDRA-8630
> <https://issues.apache.org/jira/browse/CASSANDRA-8630>, both delivered in
> 3.0.
>
>
We saw the same behavior of a lot of small unnecessary reads when we logged
read sizes in ChannelProxy.


> This code changed significantly in 3.0, and then again more recently in the
> latest 3.x releases, and it is still in the process of changing. I am more
> familiar with the 3.0 changes than the recent ones,  but if you are
> interested in deep diving in the code, and if you specify which version, I
> can be more precise and point you to the relevant code sections.
>

Unfortunately we are looking at the 2.2.6 code base right now, so it sounds
like we may be off a lot from newer 3.x work.


>
> The quick summary is that, for standard access mode, the buffer size is set
> to 64k if there is a limiter, even if its throughput is unlimited, which
> means that for compaction and other seq. ops that use a limiter, the buffer
> size will be 64k whilst for other read ops the buffer size, in 3.0+, is
> calculated in SegmentedFile.bufferSize()
> <
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/util/SegmentedFile.java#L226
> >
>

Ok, thanks, that's a good starting point. This is the access point for any
RAR and/or compaction i/o on a compressed SST?


> and it is based on the average record size, the disk type and two other
> configurable parameters: the chance that a record will cross an aligned
> cache page, and the percentile used to estimate the record size. You find
> these 3 parameters here
> <
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/config/Config.java#L261
> >
> and you may try optimizing them given that the ticket for benchmarking
> them,
> CASSANDRA-10407 <https://issues.apache.org/jira/browse/CASSANDRA-10407>,
> is
> still open. The buffer size will always be rounded to multiples of the
> cache page size, 4k.
>
> It is currently not possible to use mmap for one type of reads and std for
> another type, as far as I know.
>

Alright, that's what I figured. It would be interesting to see what a mixed
workload would look like.

Thanks for the info!

Cheers,

 Mike





> On Sun, Jun 19, 2016 at 6:17 PM, Mike Heffner <m...@librato.com> wrote:
>
> > Hi,
> >
> > I'm curious to know if anyone has attempted to improve read IOPS
> > performance during sequential I/O operations (largely compactions) while
> > still maintaining read performance for small row, random-access client
> > reads?
> >
> > Our use case is very high write load to read load ratio with rows that
> are
> > small (< 1-2kb). We've taken many of the steps to ensure that client
> reads
> > to random-access rows are optimal by reducing read_ahead and even using a
> > smaller default LZ4 chunk size. So far performance has been great with
> p95
> > read times that are < 10ms.
> >
> > However, we have noticed that our total read IOPS to the Cassandra data
> > drive is extremely high compared to our write IOPS, almost 15x the write
> > IOPS to the same drive. We even setup a ring that took the same write
> load
> > with zero client reads and observed that the high read IOPS were driven
> by
> > compaction operations. During large (>50GB) compactions, write and read
> > volume (bytes) were nearly identical which matched our assumptions, while
> > read iops were 15x write iops.
> >
> > When we plotted the average read and write op size we saw an average read
> > ops size of just under 5KB and average write op of 120KB. Given we are
> > using the default disk access mode of mmap, this aligns with our
> assumption
> > that we are paging in a single 4KB page at a time while the write size is
> > coalescing write flushes. We wanted to test this, so we switched a single
> > node to `disk_access_mode:standard`, which should do reads based on the
> > chunksizes, and found that read op size increased to ~7.5KB:
> >
> > https://imgur.com/okbfFby
> >
> > We don't want to sacrifice our read performance, but we also must
> > scale/size our disk performance based on peak iops. If we could cut the
> > read iops by a quarter or even half during compaction operations, that
> > would mean a large cost savings. We are also limited on drive throughput,
> > so there's a theoretical maximum op size we'd want to use to stay under
> > that throughput limit. Alternatively, we could also tune compaction
> > throughput to maintain that limit too.
> >
> > Has any work been done to optimize sequential I/O operations in
> Cassandra?
> > Naively it seems that sequential I/O operations could use a standard disk
> > access mode reader with configurable block size while normal read
> > operations stuck to the mmap'd segments. Being unfamiliar with the code,
> > are compaction/sequential sstable reads done through any single interface
> > or does it use the same as normal read ops?
> >
> > Thoughts?
> >
> > -Mike
> >
> > --
> >
> >   Mike Heffner <m...@librato.com>
> >   Librato, Inc.
> >
>
>
>
> --
>
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Stefania Alborghetti
>
> Apache Cassandra Software Engineer
>
> |+852 6114 9265| stefania.alborghe...@datastax.com
>
>
> [image: cassandrasummit.org/Email_Signature]
> <http://cassandrasummit.org/Email_Signature>
>



-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.

Re: Optimizing IOPS during sequential I/O (compactions)

Reply via email to