Hi, I ran some benchmarks on my laptop https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16656821&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16656821
For a random read workload, varying chunk size: Chunk size Time 64k 25:20 64k 25:33 32k 20:01 16k 19:19 16k 19:14 8k 16:51 4k 15:39 Ariel On Thu, Oct 18, 2018, at 2:55 PM, Ariel Weisberg wrote: > Hi, > > For those who were asking about the performance impact of block size on > compression I wrote a microbenchmark. > > https://pastebin.com/RHDNLGdC > > [java] Benchmark Mode > Cnt Score Error Units > [java] CompactIntegerSequenceBench.benchCompressLZ4Fast16k thrpt > 15 331190055.685 ± 8079758.044 ops/s > [java] CompactIntegerSequenceBench.benchCompressLZ4Fast32k thrpt > 15 353024925.655 ± 7980400.003 ops/s > [java] CompactIntegerSequenceBench.benchCompressLZ4Fast64k thrpt > 15 365664477.654 ± 10083336.038 ops/s > [java] CompactIntegerSequenceBench.benchCompressLZ4Fast8k thrpt > 15 305518114.172 ± 11043705.883 ops/s > [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast16k thrpt > 15 688369529.911 ± 25620873.933 ops/s > [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast32k thrpt > 15 703635848.895 ± 5296941.704 ops/s > [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast64k thrpt > 15 695537044.676 ± 17400763.731 ops/s > [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast8k thrpt > 15 727725713.128 ± 4252436.331 ops/s > > To summarize, compression is 8.5% slower and decompression is 1% faster. > This is measuring the impact on compression/decompression not the huge > impact that would occur if we decompressed data we don't need less > often. > > I didn't test decompression of Snappy and LZ4 high, but I did test > compression. > > Snappy: > [java] CompactIntegerSequenceBench.benchCompressSnappy16k thrpt > 2 196574766.116 ops/s > [java] CompactIntegerSequenceBench.benchCompressSnappy32k thrpt > 2 198538643.844 ops/s > [java] CompactIntegerSequenceBench.benchCompressSnappy64k thrpt > 2 194600497.613 ops/s > [java] CompactIntegerSequenceBench.benchCompressSnappy8k thrpt > 2 186040175.059 ops/s > > LZ4 high compressor: > [java] CompactIntegerSequenceBench.bench16k thrpt 2 > 20822947.578 ops/s > [java] CompactIntegerSequenceBench.bench32k thrpt 2 > 12037342.253 ops/s > [java] CompactIntegerSequenceBench.bench64k thrpt 2 > 6782534.469 ops/s > [java] CompactIntegerSequenceBench.bench8k thrpt 2 > 32254619.594 ops/s > > LZ4 high is the one instance where block size mattered a lot. It's a bit > suspicious really when you look at the ratio of performance to block > size being close to 1:1. I couldn't spot a bug in the benchmark though. > > Compression ratios with LZ4 fast for the text of Alice in Wonderland was: > > Chunk size 8192, ratio 0.709473 > Chunk size 16384, ratio 0.667236 > Chunk size 32768, ratio 0.634735 > Chunk size 65536, ratio 0.607208 > > By way of comparison I also ran deflate with maximum compression: > > Chunk size 8192, ratio 0.426434 > Chunk size 16384, ratio 0.402423 > Chunk size 32768, ratio 0.381627 > Chunk size 65536, ratio 0.364865 > > Ariel > > On Thu, Oct 18, 2018, at 5:32 AM, Benedict Elliott Smith wrote: > > FWIW, I’m not -0, just think that long after the freeze date a change > > like this needs a strong mandate from the community. I think the change > > is a good one. > > > > > > > > > > > > > On 17 Oct 2018, at 22:09, Ariel Weisberg <ar...@weisberg.ws> wrote: > > > > > > Hi, > > > > > > It's really not appreciably slower compared to the decompression we are > > > going to do which is going to take several microseconds. Decompression is > > > also going to be faster because we are going to do less unnecessary > > > decompression and the decompression itself may be faster since it may fit > > > in a higher level cache better. I ran a microbenchmark comparing them. > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16653988&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16653988 > > > > > > Fetching a long from memory: 56 nanoseconds > > > Compact integer sequence : 80 nanoseconds > > > Summing integer sequence : 165 nanoseconds > > > > > > Currently we have one +1 from Kurt to change the representation and > > > possibly a -0 from Benedict. That's not really enough to make an > > > exception to the code freeze. If you want it to happen (or not) you need > > > to speak up otherwise only the default will change. > > > > > > Regards, > > > Ariel > > > > > > On Wed, Oct 17, 2018, at 6:40 AM, kurt greaves wrote: > > >> I think if we're going to drop it to 16k, we should invest in the compact > > >> sequencing as well. Just lowering it to 16k will have potentially a > > >> painful > > >> impact on anyone running low memory nodes, but if we can do it without > > >> the > > >> memory impact I don't think there's any reason to wait another major > > >> version to implement it. > > >> > > >> Having said that, we should probably benchmark the two representations > > >> Ariel has come up with. > > >> > > >> On Wed, 17 Oct 2018 at 20:17, Alain RODRIGUEZ <arodr...@gmail.com> wrote: > > >> > > >>> +1 > > >>> > > >>> I would guess a lot of C* clusters/tables have this option set to the > > >>> default value, and not many of them are having the need for reading so > > >>> big > > >>> chunks of data. > > >>> I believe this will greatly limit disk overreads for a fair amount (a > > >>> big > > >>> majority?) of new users. It seems fair enough to change this default > > >>> value, > > >>> I also think 4.0 is a nice place to do this. > > >>> > > >>> Thanks for taking care of this Ariel and for making sure there is a > > >>> consensus here as well, > > >>> > > >>> C*heers, > > >>> ----------------------- > > >>> Alain Rodriguez - al...@thelastpickle.com > > >>> France / Spain > > >>> > > >>> The Last Pickle - Apache Cassandra Consulting > > >>> http://www.thelastpickle.com > > >>> > > >>> Le sam. 13 oct. 2018 à 08:52, Ariel Weisberg <ar...@weisberg.ws> a > > >>> écrit : > > >>> > > >>>> Hi, > > >>>> > > >>>> This would only impact new tables, existing tables would get their > > >>>> chunk_length_in_kb from the existing schema. It's something we record > > >>>> in > > >>> a > > >>>> system table. > > >>>> > > >>>> I have an implementation of a compact integer sequence that only > > >>>> requires > > >>>> 37% of the memory required today. So we could do this with only > > >>>> slightly > > >>>> more than doubling the memory used. I'll post that to the JIRA soon. > > >>>> > > >>>> Ariel > > >>>> > > >>>> On Fri, Oct 12, 2018, at 1:56 AM, Jeff Jirsa wrote: > > >>>>> > > >>>>> > > >>>>> I think 16k is a better default, but it should only affect new tables. > > >>>>> Whoever changes it, please make sure you think about the upgrade path. > > >>>>> > > >>>>> > > >>>>>> On Oct 12, 2018, at 2:31 AM, Ben Bromhead <b...@instaclustr.com> > > >>> wrote: > > >>>>>> > > >>>>>> This is something that's bugged me for ages, tbh the performance gain > > >>>> for > > >>>>>> most use cases far outweighs the increase in memory usage and I would > > >>>> even > > >>>>>> be in favor of changing the default now, optimizing the storage cost > > >>>> later > > >>>>>> (if it's found to be worth it). > > >>>>>> > > >>>>>> For some anecdotal evidence: > > >>>>>> 4kb is usually what we end setting it to, 16kb feels more reasonable > > >>>> given > > >>>>>> the memory impact, but what would be the point if practically, most > > >>>> folks > > >>>>>> set it to 4kb anyway? > > >>>>>> > > >>>>>> Note that chunk_length will largely be dependent on your read sizes, > > >>>> but 4k > > >>>>>> is the floor for most physical devices in terms of ones block size. > > >>>>>> > > >>>>>> +1 for making this change in 4.0 given the small size and the large > > >>>>>> improvement to new users experience (as long as we are explicit in > > >>> the > > >>>>>> documentation about memory consumption). > > >>>>>> > > >>>>>> > > >>>>>>> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <ar...@weisberg.ws> > > >>>> wrote: > > >>>>>>> > > >>>>>>> Hi, > > >>>>>>> > > >>>>>>> This is regarding > > >>>> https://issues.apache.org/jira/browse/CASSANDRA-13241 > > >>>>>>> > > >>>>>>> This ticket has languished for a while. IMO it's too late in 4.0 to > > >>>>>>> implement a more memory efficient representation for compressed > > >>> chunk > > >>>>>>> offsets. However I don't think we should put out another release > > >>> with > > >>>> the > > >>>>>>> current 64k default as it's pretty unreasonable. > > >>>>>>> > > >>>>>>> I propose that we lower the value to 16kb. 4k might never be the > > >>>> correct > > >>>>>>> default anyways as there is a cost to compression and 16k will still > > >>>> be a > > >>>>>>> large improvement. > > >>>>>>> > > >>>>>>> Benedict and Jon Haddad are both +1 on making this change for 4.0. > > >>> In > > >>>> the > > >>>>>>> past there has been some consensus about reducing this value > > >>> although > > >>>> maybe > > >>>>>>> with more memory efficiency. > > >>>>>>> > > >>>>>>> The napkin math for what this costs is: > > >>>>>>> "If you have 1TB of uncompressed data, with 64k chunks that's 16M > > >>>> chunks > > >>>>>>> at 8 bytes each (128MB). > > >>>>>>> With 16k chunks, that's 512MB. > > >>>>>>> With 4k chunks, it's 2G. > > >>>>>>> Per terabyte of data (pre-compression)." > > >>>>>>> > > >>>>>>> > > >>>> > > >>> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621 > > >>>>>>> > > >>>>>>> By way of comparison memory mapping the files has a similar cost per > > >>>> 4k > > >>>>>>> page of 8 bytes. Multiple mappings makes this more expensive. With a > > >>>>>>> default of 16kb this would be 4x less expensive than memory mapping > > >>> a > > >>>> file. > > >>>>>>> I only mention this to give a sense of the costs we are already > > >>>> paying. I > > >>>>>>> am not saying they are directly related. > > >>>>>>> > > >>>>>>> I'll wait a week for discussion and if there is consensus make the > > >>>> change. > > >>>>>>> > > >>>>>>> Regards, > > >>>>>>> Ariel > > >>>>>>> > > >>>>>>> > > >>> --------------------------------------------------------------------- > > >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > >>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org > > >>>>>>> > > >>>>>>> -- > > >>>>>> Ben Bromhead > > >>>>>> CTO | Instaclustr <https://www.instaclustr.com/> > > >>>>>> +1 650 284 9692 > > >>>>>> Reliability at Scale > > >>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer > > >>>>> > > >>>>> --------------------------------------------------------------------- > > >>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > >>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org > > >>>>> > > >>>> > > >>>> --------------------------------------------------------------------- > > >>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > >>>> For additional commands, e-mail: dev-h...@cassandra.apache.org > > >>>> > > >>>> > > >>> > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org