Re: CASSANDRA-13241 lower default chunk_length_in_kb

Ariel Weisberg Thu, 18 Oct 2018 11:55:35 -0700

Hi,

For those who were asking about the performance impact of block size on 
compression I wrote a microbenchmark.


https://pastebin.com/RHDNLGdC

     [java] Benchmark                                               Mode  Cnt   
       Score          Error  Units
     [java] CompactIntegerSequenceBench.benchCompressLZ4Fast16k    thrpt   15  
331190055.685 ±  8079758.044  ops/s
     [java] CompactIntegerSequenceBench.benchCompressLZ4Fast32k    thrpt   15  
353024925.655 ±  7980400.003  ops/s
     [java] CompactIntegerSequenceBench.benchCompressLZ4Fast64k    thrpt   15  
365664477.654 ± 10083336.038  ops/s
     [java] CompactIntegerSequenceBench.benchCompressLZ4Fast8k     thrpt   15  
305518114.172 ± 11043705.883  ops/s
     [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast16k  thrpt   15  
688369529.911 ± 25620873.933  ops/s
     [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast32k  thrpt   15  
703635848.895 ±  5296941.704  ops/s
     [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast64k  thrpt   15  
695537044.676 ± 17400763.731  ops/s
     [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast8k   thrpt   15  
727725713.128 ±  4252436.331  ops/s

To summarize, compression is 8.5% slower and decompression is 1% faster. This 
is measuring the impact on compression/decompression not the huge impact that 
would occur if we decompressed data we don't need less often.

I didn't test decompression of Snappy and LZ4 high, but I did test compression.

Snappy:
     [java] CompactIntegerSequenceBench.benchCompressSnappy16k   thrpt    2  
196574766.116          ops/s
     [java] CompactIntegerSequenceBench.benchCompressSnappy32k   thrpt    2  
198538643.844          ops/s
     [java] CompactIntegerSequenceBench.benchCompressSnappy64k   thrpt    2  
194600497.613          ops/s
     [java] CompactIntegerSequenceBench.benchCompressSnappy8k    thrpt    2  
186040175.059          ops/s

LZ4 high compressor:
     [java] CompactIntegerSequenceBench.bench16k  thrpt    2  20822947.578      
    ops/s
     [java] CompactIntegerSequenceBench.bench32k  thrpt    2  12037342.253      
    ops/s
     [java] CompactIntegerSequenceBench.bench64k  thrpt    2   6782534.469      
    ops/s
     [java] CompactIntegerSequenceBench.bench8k   thrpt    2  32254619.594      
    ops/s

LZ4 high is the one instance where block size mattered a lot. It's a bit 
suspicious really when you look at the ratio of performance to block size being 
close to 1:1. I couldn't spot a bug in the benchmark though.

Compression ratios with LZ4 fast for the text of Alice in Wonderland was:

Chunk size 8192, ratio 0.709473
Chunk size 16384, ratio 0.667236
Chunk size 32768, ratio 0.634735
Chunk size 65536, ratio 0.607208

By way of comparison I also ran deflate with maximum compression:

Chunk size 8192, ratio 0.426434
Chunk size 16384, ratio 0.402423
Chunk size 32768, ratio 0.381627
Chunk size 65536, ratio 0.364865

Ariel
 
On Thu, Oct 18, 2018, at 5:32 AM, Benedict Elliott Smith wrote:
> FWIW, I’m not -0, just think that long after the freeze date a change 
> like this needs a strong mandate from the community.  I think the change 
> is a good one.
> 
> 
> 
> 
> 
> > On 17 Oct 2018, at 22:09, Ariel Weisberg <[email protected]> wrote:
> > 
> > Hi,
> > 
> > It's really not appreciably slower compared to the decompression we are 
> > going to do which is going to take several microseconds. Decompression is 
> > also going to be faster because we are going to do less unnecessary 
> > decompression and the decompression itself may be faster since it may fit 
> > in a higher level cache better. I ran a microbenchmark comparing them.
> > 
> > https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16653988&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16653988
> > 
> > Fetching a long from memory:       56 nanoseconds
> > Compact integer sequence   :       80 nanoseconds
> > Summing integer sequence   :      165 nanoseconds
> > 
> > Currently we have one +1 from Kurt to change the representation and 
> > possibly a -0 from Benedict. That's not really enough to make an exception 
> > to the code freeze. If you want it to happen (or not) you need to speak up 
> > otherwise only the default will change.
> > 
> > Regards,
> > Ariel
> > 
> > On Wed, Oct 17, 2018, at 6:40 AM, kurt greaves wrote:
> >> I think if we're going to drop it to 16k, we should invest in the compact
> >> sequencing as well. Just lowering it to 16k will have potentially a painful
> >> impact on anyone running low memory nodes, but if we can do it without the
> >> memory impact I don't think there's any reason to wait another major
> >> version to implement it.
> >> 
> >> Having said that, we should probably benchmark the two representations
> >> Ariel has come up with.
> >> 
> >> On Wed, 17 Oct 2018 at 20:17, Alain RODRIGUEZ <[email protected]> wrote:
> >> 
> >>> +1
> >>> 
> >>> I would guess a lot of C* clusters/tables have this option set to the
> >>> default value, and not many of them are having the need for reading so big
> >>> chunks of data.
> >>> I believe this will greatly limit disk overreads for a fair amount (a big
> >>> majority?) of new users. It seems fair enough to change this default 
> >>> value,
> >>> I also think 4.0 is a nice place to do this.
> >>> 
> >>> Thanks for taking care of this Ariel and for making sure there is a
> >>> consensus here as well,
> >>> 
> >>> C*heers,
> >>> -----------------------
> >>> Alain Rodriguez - [email protected]
> >>> France / Spain
> >>> 
> >>> The Last Pickle - Apache Cassandra Consulting
> >>> http://www.thelastpickle.com
> >>> 
> >>> Le sam. 13 oct. 2018 à 08:52, Ariel Weisberg <[email protected]> a écrit :
> >>> 
> >>>> Hi,
> >>>> 
> >>>> This would only impact new tables, existing tables would get their
> >>>> chunk_length_in_kb from the existing schema. It's something we record in
> >>> a
> >>>> system table.
> >>>> 
> >>>> I have an implementation of a compact integer sequence that only requires
> >>>> 37% of the memory required today. So we could do this with only slightly
> >>>> more than doubling the memory used. I'll post that to the JIRA soon.
> >>>> 
> >>>> Ariel
> >>>> 
> >>>> On Fri, Oct 12, 2018, at 1:56 AM, Jeff Jirsa wrote:
> >>>>> 
> >>>>> 
> >>>>> I think 16k is a better default, but it should only affect new tables.
> >>>>> Whoever changes it, please make sure you think about the upgrade path.
> >>>>> 
> >>>>> 
> >>>>>> On Oct 12, 2018, at 2:31 AM, Ben Bromhead <[email protected]>
> >>> wrote:
> >>>>>> 
> >>>>>> This is something that's bugged me for ages, tbh the performance gain
> >>>> for
> >>>>>> most use cases far outweighs the increase in memory usage and I would
> >>>> even
> >>>>>> be in favor of changing the default now, optimizing the storage cost
> >>>> later
> >>>>>> (if it's found to be worth it).
> >>>>>> 
> >>>>>> For some anecdotal evidence:
> >>>>>> 4kb is usually what we end setting it to, 16kb feels more reasonable
> >>>> given
> >>>>>> the memory impact, but what would be the point if practically, most
> >>>> folks
> >>>>>> set it to 4kb anyway?
> >>>>>> 
> >>>>>> Note that chunk_length will largely be dependent on your read sizes,
> >>>> but 4k
> >>>>>> is the floor for most physical devices in terms of ones block size.
> >>>>>> 
> >>>>>> +1 for making this change in 4.0 given the small size and the large
> >>>>>> improvement to new users experience (as long as we are explicit in
> >>> the
> >>>>>> documentation about memory consumption).
> >>>>>> 
> >>>>>> 
> >>>>>>> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <[email protected]>
> >>>> wrote:
> >>>>>>> 
> >>>>>>> Hi,
> >>>>>>> 
> >>>>>>> This is regarding
> >>>> https://issues.apache.org/jira/browse/CASSANDRA-13241
> >>>>>>> 
> >>>>>>> This ticket has languished for a while. IMO it's too late in 4.0 to
> >>>>>>> implement a more memory efficient representation for compressed
> >>> chunk
> >>>>>>> offsets. However I don't think we should put out another release
> >>> with
> >>>> the
> >>>>>>> current 64k default as it's pretty unreasonable.
> >>>>>>> 
> >>>>>>> I propose that we lower the value to 16kb. 4k might never be the
> >>>> correct
> >>>>>>> default anyways as there is a cost to compression and 16k will still
> >>>> be a
> >>>>>>> large improvement.
> >>>>>>> 
> >>>>>>> Benedict and Jon Haddad are both +1 on making this change for 4.0.
> >>> In
> >>>> the
> >>>>>>> past there has been some consensus about reducing this value
> >>> although
> >>>> maybe
> >>>>>>> with more memory efficiency.
> >>>>>>> 
> >>>>>>> The napkin math for what this costs is:
> >>>>>>> "If you have 1TB of uncompressed data, with 64k chunks that's 16M
> >>>> chunks
> >>>>>>> at 8 bytes each (128MB).
> >>>>>>> With 16k chunks, that's 512MB.
> >>>>>>> With 4k chunks, it's 2G.
> >>>>>>> Per terabyte of data (pre-compression)."
> >>>>>>> 
> >>>>>>> 
> >>>> 
> >>> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621
> >>>>>>> 
> >>>>>>> By way of comparison memory mapping the files has a similar cost per
> >>>> 4k
> >>>>>>> page of 8 bytes. Multiple mappings makes this more expensive. With a
> >>>>>>> default of 16kb this would be 4x less expensive than memory mapping
> >>> a
> >>>> file.
> >>>>>>> I only mention this to give a sense of the costs we are already
> >>>> paying. I
> >>>>>>> am not saying they are directly related.
> >>>>>>> 
> >>>>>>> I'll wait a week for discussion and if there is consensus make the
> >>>> change.
> >>>>>>> 
> >>>>>>> Regards,
> >>>>>>> Ariel
> >>>>>>> 
> >>>>>>> 
> >>> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>> 
> >>>>>>> --
> >>>>>> Ben Bromhead
> >>>>>> CTO | Instaclustr <https://www.instaclustr.com/>
> >>>>>> +1 650 284 9692
> >>>>>> Reliability at Scale
> >>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
> >>>>> 
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [email protected]
> >>>>> For additional commands, e-mail: [email protected]
> >>>>> 
> >>>> 
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>> 
> >>>> 
> >>> 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: CASSANDRA-13241 lower default chunk_length_in_kb

Reply via email to