Re: Optimizing IOPS during sequential I/O (compactions)

Stefania Alborghetti Sun, 26 Jun 2016 21:21:38 -0700

Hi Mike

Ok, thanks, that's a good starting point. This is the access point for any
> RAR and/or compaction i/o on a compressed SST?
>


In 2.2, SegmentedFile.bufferSize() does not exist, instead we use:


   -  the compression chunk length for compressed tables
   -  RandomAccessReader.DEFAULT_BUFFER_SIZE (64kb) for throttled readers
   (seq IO)
   -  RandomAccessReader.BUFFER_SIZE for everything else, which defaults to
   DEFAULT_BUFFER_SIZE but can be set to something else via
   -Dcassandra.rar_buffer_size, see CASSANDRA-10249
   <https://issues.apache.org/jira/browse/CASSANDRA-10249>.

For compressed sstables in 2.2, everything goes from the constructor of
CompressedRandomAccessReader, and as you can see, the buffer size is the
compressed chunk length, that can be set via the compression parameters.

I should have pointed out in my previous email that also in 3.0+ we use the
compression chunk length as the buffer size. CASSANDRA-5863
<https://issues.apache.org/jira/browse/CASSANDRA-5863> has introduced a
chunk cache in 3.6 to speed things up.

Another thing to mention, you probably know already but just in case, in
3.0+ compression is not as important as in older releases since the new
storage engine has reduced disk usage, so if it is a possibility at all I
would consider trying out 3.0.x or 3.x without compression, and some
tweaking of the CASSANDRA-8894 parameters to reduce read ahead.


Best regards,
Stefania


On Sat, Jun 25, 2016 at 4:11 AM, Mike Heffner <m...@librato.com> wrote:

> Hi Stefania,
>
>
> > Try reading the discussions on CASSANDRA-8894
> > <https://issues.apache.org/jira/browse/CASSANDRA-8894>, plus related
> >
>
> Yes, I agree. I think the difficult is trying to get the smallest reads
> when performing any RAR operation, but trying to optimize for optimal read
> size when doing large sequential I/O operations.
>
>
>
> > tickets, and CASSANDRA-8630
> > <https://issues.apache.org/jira/browse/CASSANDRA-8630>, both delivered
> in
> > 3.0.
> >
> >
> We saw the same behavior of a lot of small unnecessary reads when we logged
> read sizes in ChannelProxy.
>
>
> > This code changed significantly in 3.0, and then again more recently in
> the
> > latest 3.x releases, and it is still in the process of changing. I am
> more
> > familiar with the 3.0 changes than the recent ones,  but if you are
> > interested in deep diving in the code, and if you specify which version,
> I
> > can be more precise and point you to the relevant code sections.
> >
>
> Unfortunately we are looking at the 2.2.6 code base right now, so it sounds
> like we may be off a lot from newer 3.x work.
>
>
> >
> > The quick summary is that, for standard access mode, the buffer size is
> set
> > to 64k if there is a limiter, even if its throughput is unlimited, which
> > means that for compaction and other seq. ops that use a limiter, the
> buffer
> > size will be 64k whilst for other read ops the buffer size, in 3.0+, is
> > calculated in SegmentedFile.bufferSize()
> > <
> >
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/util/SegmentedFile.java#L226
> > >
> >
>
> Ok, thanks, that's a good starting point. This is the access point for any
> RAR and/or compaction i/o on a compressed SST?
>
>
> > and it is based on the average record size, the disk type and two other
> > configurable parameters: the chance that a record will cross an aligned
> > cache page, and the percentile used to estimate the record size. You find
> > these 3 parameters here
> > <
> >
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/config/Config.java#L261
> > >
> > and you may try optimizing them given that the ticket for benchmarking
> > them,
> > CASSANDRA-10407 <https://issues.apache.org/jira/browse/CASSANDRA-10407>,
> > is
> > still open. The buffer size will always be rounded to multiples of the
> > cache page size, 4k.
> >
> > It is currently not possible to use mmap for one type of reads and std
> for
> > another type, as far as I know.
> >
>
> Alright, that's what I figured. It would be interesting to see what a mixed
> workload would look like.
>
> Thanks for the info!
>
> Cheers,
>
>  Mike
>
>
>
>
>
> > On Sun, Jun 19, 2016 at 6:17 PM, Mike Heffner <m...@librato.com> wrote:
> >
> > > Hi,
> > >
> > > I'm curious to know if anyone has attempted to improve read IOPS
> > > performance during sequential I/O operations (largely compactions)
> while
> > > still maintaining read performance for small row, random-access client
> > > reads?
> > >
> > > Our use case is very high write load to read load ratio with rows that
> > are
> > > small (< 1-2kb). We've taken many of the steps to ensure that client
> > reads
> > > to random-access rows are optimal by reducing read_ahead and even
> using a
> > > smaller default LZ4 chunk size. So far performance has been great with
> > p95
> > > read times that are < 10ms.
> > >
> > > However, we have noticed that our total read IOPS to the Cassandra data
> > > drive is extremely high compared to our write IOPS, almost 15x the
> write
> > > IOPS to the same drive. We even setup a ring that took the same write
> > load
> > > with zero client reads and observed that the high read IOPS were driven
> > by
> > > compaction operations. During large (>50GB) compactions, write and read
> > > volume (bytes) were nearly identical which matched our assumptions,
> while
> > > read iops were 15x write iops.
> > >
> > > When we plotted the average read and write op size we saw an average
> read
> > > ops size of just under 5KB and average write op of 120KB. Given we are
> > > using the default disk access mode of mmap, this aligns with our
> > assumption
> > > that we are paging in a single 4KB page at a time while the write size
> is
> > > coalescing write flushes. We wanted to test this, so we switched a
> single
> > > node to `disk_access_mode:standard`, which should do reads based on the
> > > chunksizes, and found that read op size increased to ~7.5KB:
> > >
> > > https://imgur.com/okbfFby
> > >
> > > We don't want to sacrifice our read performance, but we also must
> > > scale/size our disk performance based on peak iops. If we could cut the
> > > read iops by a quarter or even half during compaction operations, that
> > > would mean a large cost savings. We are also limited on drive
> throughput,
> > > so there's a theoretical maximum op size we'd want to use to stay under
> > > that throughput limit. Alternatively, we could also tune compaction
> > > throughput to maintain that limit too.
> > >
> > > Has any work been done to optimize sequential I/O operations in
> > Cassandra?
> > > Naively it seems that sequential I/O operations could use a standard
> disk
> > > access mode reader with configurable block size while normal read
> > > operations stuck to the mmap'd segments. Being unfamiliar with the
> code,
> > > are compaction/sequential sstable reads done through any single
> interface
> > > or does it use the same as normal read ops?
> > >
> > > Thoughts?
> > >
> > > -Mike
> > >
> > > --
> > >
> > >   Mike Heffner <m...@librato.com>
> > >   Librato, Inc.
> > >
> >
> >
> >
> > --
> >
> >
> > [image: datastax_logo.png] <http://www.datastax.com/>
> >
> > Stefania Alborghetti
> >
> > Apache Cassandra Software Engineer
> >
> > |+852 6114 9265| stefania.alborghe...@datastax.com
> >
> >
> > [image: cassandrasummit.org/Email_Signature]
> > <http://cassandrasummit.org/Email_Signature>
> >
>
>
>
> --
>
>   Mike Heffner <m...@librato.com>
>   Librato, Inc.
>



-- 


Stefania Alborghetti

Apache Cassandra Committer

|+852 6114 9265| stefania.alborghe...@datastax.com


[image: cassandrasummit.org/Email_Signature]
<http://cassandrasummit.org/Email_Signature>

Re: Optimizing IOPS during sequential I/O (compactions)

Reply via email to