Agree with Sylvain (and I think Benedict) - there’s no compelling reason to violate the freeze here. We’ve had the wrong default for years - add a note to the docs that we’ll be changing it in the future, but let’s not violate the freeze now.
-- Jeff Jirsa > On Oct 19, 2018, at 10:06 AM, Sylvain Lebresne <lebre...@gmail.com> wrote: > > Fwiw, as much as I agree this is a change worth doing in general, I do am > -0 for 4.0. Both the "compact sequencing" and the change of default really. > We're closing on 2 months within the freeze, and for me a freeze do include > not changing defaults, because changing default ideally imply a decent > amount of analysis/benchmark of the consequence of that change[1] and that > doesn't enter in my definition of a freeze. > > [1]: to be extra clear, I'm not saying we've always done this, far from it. > But I hope we can all agree we were wrong to no do it when we didn't and > should strive to improve, not repeat past mistakes. > -- > Sylvain > > >> On Thu, Oct 18, 2018 at 8:55 PM Ariel Weisberg <ar...@weisberg.ws> wrote: >> >> Hi, >> >> For those who were asking about the performance impact of block size on >> compression I wrote a microbenchmark. >> >> https://pastebin.com/RHDNLGdC >> >> [java] Benchmark Mode >> Cnt Score Error Units >> [java] CompactIntegerSequenceBench.benchCompressLZ4Fast16k thrpt >> 15 331190055.685 ± 8079758.044 ops/s >> [java] CompactIntegerSequenceBench.benchCompressLZ4Fast32k thrpt >> 15 353024925.655 ± 7980400.003 ops/s >> [java] CompactIntegerSequenceBench.benchCompressLZ4Fast64k thrpt >> 15 365664477.654 ± 10083336.038 ops/s >> [java] CompactIntegerSequenceBench.benchCompressLZ4Fast8k thrpt >> 15 305518114.172 ± 11043705.883 ops/s >> [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast16k thrpt >> 15 688369529.911 ± 25620873.933 ops/s >> [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast32k thrpt >> 15 703635848.895 ± 5296941.704 ops/s >> [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast64k thrpt >> 15 695537044.676 ± 17400763.731 ops/s >> [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast8k thrpt >> 15 727725713.128 ± 4252436.331 ops/s >> >> To summarize, compression is 8.5% slower and decompression is 1% faster. >> This is measuring the impact on compression/decompression not the huge >> impact that would occur if we decompressed data we don't need less often. >> >> I didn't test decompression of Snappy and LZ4 high, but I did test >> compression. >> >> Snappy: >> [java] CompactIntegerSequenceBench.benchCompressSnappy16k thrpt >> 2 196574766.116 ops/s >> [java] CompactIntegerSequenceBench.benchCompressSnappy32k thrpt >> 2 198538643.844 ops/s >> [java] CompactIntegerSequenceBench.benchCompressSnappy64k thrpt >> 2 194600497.613 ops/s >> [java] CompactIntegerSequenceBench.benchCompressSnappy8k thrpt >> 2 186040175.059 ops/s >> >> LZ4 high compressor: >> [java] CompactIntegerSequenceBench.bench16k thrpt 2 >> 20822947.578 ops/s >> [java] CompactIntegerSequenceBench.bench32k thrpt 2 >> 12037342.253 ops/s >> [java] CompactIntegerSequenceBench.bench64k thrpt 2 >> 6782534.469 ops/s >> [java] CompactIntegerSequenceBench.bench8k thrpt 2 >> 32254619.594 ops/s >> >> LZ4 high is the one instance where block size mattered a lot. It's a bit >> suspicious really when you look at the ratio of performance to block size >> being close to 1:1. I couldn't spot a bug in the benchmark though. >> >> Compression ratios with LZ4 fast for the text of Alice in Wonderland was: >> >> Chunk size 8192, ratio 0.709473 >> Chunk size 16384, ratio 0.667236 >> Chunk size 32768, ratio 0.634735 >> Chunk size 65536, ratio 0.607208 >> >> By way of comparison I also ran deflate with maximum compression: >> >> Chunk size 8192, ratio 0.426434 >> Chunk size 16384, ratio 0.402423 >> Chunk size 32768, ratio 0.381627 >> Chunk size 65536, ratio 0.364865 >> >> Ariel >> >>> On Thu, Oct 18, 2018, at 5:32 AM, Benedict Elliott Smith wrote: >>> FWIW, I’m not -0, just think that long after the freeze date a change >>> like this needs a strong mandate from the community. I think the change >>> is a good one. >>> >>> >>> >>> >>> >>>> On 17 Oct 2018, at 22:09, Ariel Weisberg <ar...@weisberg.ws> wrote: >>>> >>>> Hi, >>>> >>>> It's really not appreciably slower compared to the decompression we >> are going to do which is going to take several microseconds. Decompression >> is also going to be faster because we are going to do less unnecessary >> decompression and the decompression itself may be faster since it may fit >> in a higher level cache better. I ran a microbenchmark comparing them. >>>> >>>> >> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16653988&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16653988 >>>> >>>> Fetching a long from memory: 56 nanoseconds >>>> Compact integer sequence : 80 nanoseconds >>>> Summing integer sequence : 165 nanoseconds >>>> >>>> Currently we have one +1 from Kurt to change the representation and >> possibly a -0 from Benedict. That's not really enough to make an exception >> to the code freeze. If you want it to happen (or not) you need to speak up >> otherwise only the default will change. >>>> >>>> Regards, >>>> Ariel >>>> >>>>> On Wed, Oct 17, 2018, at 6:40 AM, kurt greaves wrote: >>>>> I think if we're going to drop it to 16k, we should invest in the >> compact >>>>> sequencing as well. Just lowering it to 16k will have potentially a >> painful >>>>> impact on anyone running low memory nodes, but if we can do it >> without the >>>>> memory impact I don't think there's any reason to wait another major >>>>> version to implement it. >>>>> >>>>> Having said that, we should probably benchmark the two representations >>>>> Ariel has come up with. >>>>> >>>>> On Wed, 17 Oct 2018 at 20:17, Alain RODRIGUEZ <arodr...@gmail.com> >> wrote: >>>>> >>>>>> +1 >>>>>> >>>>>> I would guess a lot of C* clusters/tables have this option set to the >>>>>> default value, and not many of them are having the need for reading >> so big >>>>>> chunks of data. >>>>>> I believe this will greatly limit disk overreads for a fair amount >> (a big >>>>>> majority?) of new users. It seems fair enough to change this default >> value, >>>>>> I also think 4.0 is a nice place to do this. >>>>>> >>>>>> Thanks for taking care of this Ariel and for making sure there is a >>>>>> consensus here as well, >>>>>> >>>>>> C*heers, >>>>>> ----------------------- >>>>>> Alain Rodriguez - al...@thelastpickle.com >>>>>> France / Spain >>>>>> >>>>>> The Last Pickle - Apache Cassandra Consulting >>>>>> http://www.thelastpickle.com >>>>>> >>>>>> Le sam. 13 oct. 2018 à 08:52, Ariel Weisberg <ar...@weisberg.ws> a >> écrit : >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> This would only impact new tables, existing tables would get their >>>>>>> chunk_length_in_kb from the existing schema. It's something we >> record in >>>>>> a >>>>>>> system table. >>>>>>> >>>>>>> I have an implementation of a compact integer sequence that only >> requires >>>>>>> 37% of the memory required today. So we could do this with only >> slightly >>>>>>> more than doubling the memory used. I'll post that to the JIRA soon. >>>>>>> >>>>>>> Ariel >>>>>>> >>>>>>>> On Fri, Oct 12, 2018, at 1:56 AM, Jeff Jirsa wrote: >>>>>>>> >>>>>>>> >>>>>>>> I think 16k is a better default, but it should only affect new >> tables. >>>>>>>> Whoever changes it, please make sure you think about the upgrade >> path. >>>>>>>> >>>>>>>> >>>>>>>>> On Oct 12, 2018, at 2:31 AM, Ben Bromhead <b...@instaclustr.com> >>>>>> wrote: >>>>>>>>> >>>>>>>>> This is something that's bugged me for ages, tbh the performance >> gain >>>>>>> for >>>>>>>>> most use cases far outweighs the increase in memory usage and I >> would >>>>>>> even >>>>>>>>> be in favor of changing the default now, optimizing the storage >> cost >>>>>>> later >>>>>>>>> (if it's found to be worth it). >>>>>>>>> >>>>>>>>> For some anecdotal evidence: >>>>>>>>> 4kb is usually what we end setting it to, 16kb feels more >> reasonable >>>>>>> given >>>>>>>>> the memory impact, but what would be the point if practically, >> most >>>>>>> folks >>>>>>>>> set it to 4kb anyway? >>>>>>>>> >>>>>>>>> Note that chunk_length will largely be dependent on your read >> sizes, >>>>>>> but 4k >>>>>>>>> is the floor for most physical devices in terms of ones block >> size. >>>>>>>>> >>>>>>>>> +1 for making this change in 4.0 given the small size and the >> large >>>>>>>>> improvement to new users experience (as long as we are explicit in >>>>>> the >>>>>>>>> documentation about memory consumption). >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg < >> ar...@weisberg.ws> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> This is regarding >>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13241 >>>>>>>>>> >>>>>>>>>> This ticket has languished for a while. IMO it's too late in 4.0 >> to >>>>>>>>>> implement a more memory efficient representation for compressed >>>>>> chunk >>>>>>>>>> offsets. However I don't think we should put out another release >>>>>> with >>>>>>> the >>>>>>>>>> current 64k default as it's pretty unreasonable. >>>>>>>>>> >>>>>>>>>> I propose that we lower the value to 16kb. 4k might never be the >>>>>>> correct >>>>>>>>>> default anyways as there is a cost to compression and 16k will >> still >>>>>>> be a >>>>>>>>>> large improvement. >>>>>>>>>> >>>>>>>>>> Benedict and Jon Haddad are both +1 on making this change for >> 4.0. >>>>>> In >>>>>>> the >>>>>>>>>> past there has been some consensus about reducing this value >>>>>> although >>>>>>> maybe >>>>>>>>>> with more memory efficiency. >>>>>>>>>> >>>>>>>>>> The napkin math for what this costs is: >>>>>>>>>> "If you have 1TB of uncompressed data, with 64k chunks that's 16M >>>>>>> chunks >>>>>>>>>> at 8 bytes each (128MB). >>>>>>>>>> With 16k chunks, that's 512MB. >>>>>>>>>> With 4k chunks, it's 2G. >>>>>>>>>> Per terabyte of data (pre-compression)." >>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>>> >> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621 >>>>>>>>>> >>>>>>>>>> By way of comparison memory mapping the files has a similar cost >> per >>>>>>> 4k >>>>>>>>>> page of 8 bytes. Multiple mappings makes this more expensive. >> With a >>>>>>>>>> default of 16kb this would be 4x less expensive than memory >> mapping >>>>>> a >>>>>>> file. >>>>>>>>>> I only mention this to give a sense of the costs we are already >>>>>>> paying. I >>>>>>>>>> am not saying they are directly related. >>>>>>>>>> >>>>>>>>>> I'll wait a week for discussion and if there is consensus make >> the >>>>>>> change. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Ariel >>>>>>>>>> >>>>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> Ben Bromhead >>>>>>>>> CTO | Instaclustr <https://www.instaclustr.com/> >>>>>>>>> +1 650 284 9692 >>>>>>>>> Reliability at Scale >>>>>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer >>>>>>>> >>>>>>>> >> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>>>> >>>>>>> >>>>>>> >> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>>> >>>>>>> >>>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: dev-h...@cassandra.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org