Re: How to measure the write amplification of C*?

Jack Krupansky Thu, 10 Mar 2016 12:56:10 -0800

The doc does say this:

"A log-structured engine that avoids overwrites and uses sequential IO to
update data is essential for writing to solid-state disks (SSD) and hard
disks (HDD) On HDD, writing randomly involves a higher number of seek
operations than sequential writing. The seek penalty incurred can be
substantial. Using sequential IO (thereby avoiding write amplification
<http://en.wikipedia.org/wiki/Write_amplification> and disk failure),
Cassandra accommodates inexpensive, consumer SSDs extremely well."


I presume that write amplification argues for placing the commit log on a
separate SSD device. That should probably be mentioned.

-- Jack Krupansky

On Thu, Mar 10, 2016 at 12:52 PM, Matt Kennedy <matt.kenn...@datastax.com>
wrote:

> It isn't really the data written by the host that you're concerned with,
> it's the data written by your application. I'd start by instrumenting your
> application tier to tally up the size of the values that it writes to C*.
>
> However, it may not be extremely useful to have this value. You can't do
> much with the information it provides. It is probably a better idea to
> track the bytes written to flash for each drive so that you know the
> physical endurance of that type of drive given your workload. Unfortunately
> the TBW endurance rated for the drive may not be extremely useful given the
> difference between the synthetic workload used to create those ratings and
> the workload that Cassandra is producing for your particular case. You can
> find out more about those here:
> https://www.jedec.org/standards-documents/docs/jesd219a
>
>
> Matt Kennedy
>
> Sr. Product Manager, DSE Core
>
> matt.kenn...@datastax.com | Public Calendar <http://goo.gl/4Ui04Z>
>
> *DataStax Enterprise - the database for cloud applications.*
>
> On Thu, Mar 10, 2016 at 11:44 AM, Dikang Gu <dikan...@gmail.com> wrote:
>
>> Hi Matt,
>>
>> Thanks for the detailed explanation! Yes, this is exactly what I'm
>> looking for, "write amplification = data written to flash/data written
>> by the host".
>>
>> We are heavily using the LCS in production, so I'd like to figure out the
>> amplification caused by that and see what we can do to optimize it. I have
>> the metrics of "data written to flash", and I'm wondering is there an
>> easy way to get the "data written by the host" on each C* node?
>>
>> Thanks
>>
>> On Thu, Mar 10, 2016 at 8:48 AM, Matt Kennedy <mkenn...@datastax.com>
>> wrote:
>>
>>> TL;DR - Cassandra actually causes a ton of write amplification but it
>>> doesn't freaking matter any more. Read on for details...
>>>
>>> That slide deck does have a lot of very good information on it, but
>>> unfortunately I think it has led to a fundamental misunderstanding about
>>> Cassandra and write amplification. In particular, slide 51 vastly
>>> oversimplifies the situation.
>>>
>>> The wikipedia definition of write amplification looks at this from the
>>> perspective of the SSD controller:
>>> https://en.wikipedia.org/wiki/Write_amplification#Calculating_the_value
>>>
>>> In short, write amplification = data written to flash/data written by
>>> the host
>>>
>>> So, if I write 1MB in my application, but the SSD has to write my 1MB,
>>> plus rearrange another 1MB of data in order to make room for it, then I've
>>> written a total of 2MB and my write amplification is 2x.
>>>
>>> In other words, it is measuring how much extra the SSD controller has to
>>> write in order to do its own housekeeping.
>>>
>>> However, the wikipedia definition is a bit more constrained than how the
>>> term is used in the storage industry. The whole point of looking at write
>>> amplification is to understand the impact that a particular workload is
>>> going to have on the underlying NAND by virtue of the data written. So a
>>> definition of write amplification that is a little more relevant to the
>>> context of Cassandra is to consider this:
>>>
>>> write amplification = data written to flash/data written to the database
>>>
>>> So, while the fact that we only sequentially write large immutable
>>> SSTables does in fact mean that controller-level write amplification is
>>> near zero, Compaction comes along and completely destroys that tidy little
>>> story. Think about it, every time a compaction re-writes data that has
>>> already been written, we are creating a lot of application-level write
>>> amplification. Different compaction strategies and the workload itself
>>> impact what the real application-level write amp is, but generally
>>> speaking, LCS is the worst, followed by STCS and DTCS will cause the least
>>> write-amp. To measure this, you can usually use smartctl (may be another
>>> mechanism depending on SSD manufacturer) to get the physical bytes written
>>> to your SSDs and divide that by the data that you've actually logically
>>> written to Cassandra. I've measured (more than two years ago) LCS write amp
>>> as high as 50x on some workloads, which is significantly higher than the
>>> typical controller level write amp on a b-tree style update-in-place data
>>> store. Also note that the new storage engine in general reduces a lot of
>>> inefficiency in the Cassandra storage engine therefore reducing the impact
>>> of write amp due to compactions.
>>>
>>> However, if you're a person that understands SSDs, at this point you're
>>> wondering why we aren't burning out SSDs right and left. The reality is
>>> that general SSD endurance has gotten so good, that all this write amp
>>> isn't really a problem any more. If you're curious to read more about that,
>>> I recommend you start here:
>>>
>>>
>>> http://hothardware.com/news/google-data-center-ssd-research-report-offers-surprising-results-slc-not-more-reliable-than-mlc-flash
>>>
>>> and the paper that article mentions:
>>>
>>> http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/23105-fast16-papers-schroeder.pdf
>>>
>>>
>>> Hope this helps.
>>>
>>>
>>> Matt Kennedy
>>>
>>>
>>>
>>> On Thu, Mar 10, 2016 at 7:05 AM, Paulo Motta <pauloricard...@gmail.com>
>>> wrote:
>>>
>>>> This is a good source on Cassandra + write amplification:
>>>> http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
>>>>
>>>> 2016-03-10 9:57 GMT-03:00 Benjamin Lerer <benjamin.le...@datastax.com>:
>>>>
>>>>> Cassandra should not cause any write amplification. Write amplification
>>>>> appends only when you updates data on SSDs. Cassandra does not update
>>>>> any
>>>>> data in place. Data can be rewritten during compaction but it is never
>>>>> updated.
>>>>>
>>>>> Benjamin
>>>>>
>>>>> On Thu, Mar 10, 2016 at 12:42 PM, Alain RODRIGUEZ <arodr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Hi Dikang,
>>>>> >
>>>>> > I am not sure about what you call "amplification", but as sizes
>>>>> highly
>>>>> > depends on the structure I think I would probably give it a try
>>>>> using CCM (
>>>>> > https://github.com/pcmanus/ccm) or some test cluster with
>>>>> 'production
>>>>> > like'
>>>>> > setting and schema. You can write a row, flush it and see how big is
>>>>> the
>>>>> > data cluster-wide / per node.
>>>>> >
>>>>> > Hope this will be of some help.
>>>>> >
>>>>> > C*heers,
>>>>> > -----------------------
>>>>> > Alain Rodriguez - al...@thelastpickle.com
>>>>> > France
>>>>> >
>>>>> > The Last Pickle - Apache Cassandra Consulting
>>>>> > http://www.thelastpickle.com
>>>>> >
>>>>> > 2016-03-10 7:18 GMT+01:00 Dikang Gu <dikan...@gmail.com>:
>>>>> >
>>>>> > > Hello there,
>>>>> > >
>>>>> > > I'm wondering is there a good way to measure the write
>>>>> amplification of
>>>>> > > Cassandra?
>>>>> > >
>>>>> > > I'm thinking it could be calculated by (size of mutations written
>>>>> to the
>>>>> > > node)/(number of bytes written to the disk).
>>>>> > >
>>>>> > > Do we already have the metrics of "size of mutations written to the
>>>>> > node"?
>>>>> > > I did not find it in jmx metrics.
>>>>> > >
>>>>> > > Thanks
>>>>> > >
>>>>> > > --
>>>>> > > Dikang
>>>>> > >
>>>>> > >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Dikang
>>
>>
>

Re: How to measure the write amplification of C*?

Reply via email to