On Thu, Oct 11, 2018 at 4:31 PM Ben Bromhead <b...@instaclustr.com> wrote:
> This is something that's bugged me for ages, tbh the performance gain for > most use cases far outweighs the increase in memory usage and I would even > be in favor of changing the default now, optimizing the storage cost later > (if it's found to be worth it). > > For some anecdotal evidence: > 4kb is usually what we end setting it to, 16kb feels more reasonable given > the memory impact, but what would be the point if practically, most folks > set it to 4kb anyway? > > Note that chunk_length will largely be dependent on your read sizes, but 4k > is the floor for most physical devices in terms of ones block size. > It might be worth while to investigate how splitting chunk size into data, index and compaction sizes would affect performance. > > +1 for making this change in 4.0 given the small size and the large > improvement to new users experience (as long as we are explicit in the > documentation about memory consumption). > > > On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <ar...@weisberg.ws> wrote: > > > Hi, > > > > This is regarding https://issues.apache.org/jira/browse/CASSANDRA-13241 > > > > This ticket has languished for a while. IMO it's too late in 4.0 to > > implement a more memory efficient representation for compressed chunk > > offsets. However I don't think we should put out another release with the > > current 64k default as it's pretty unreasonable. > > > > I propose that we lower the value to 16kb. 4k might never be the correct > > default anyways as there is a cost to compression and 16k will still be a > > large improvement. > > > > Benedict and Jon Haddad are both +1 on making this change for 4.0. In the > > past there has been some consensus about reducing this value although > maybe > > with more memory efficiency. > > > > The napkin math for what this costs is: > > "If you have 1TB of uncompressed data, with 64k chunks that's 16M chunks > > at 8 bytes each (128MB). > > With 16k chunks, that's 512MB. > > With 4k chunks, it's 2G. > > Per terabyte of data (pre-compression)." > > > > > https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621 > > > > By way of comparison memory mapping the files has a similar cost per 4k > > page of 8 bytes. Multiple mappings makes this more expensive. With a > > default of 16kb this would be 4x less expensive than memory mapping a > file. > > I only mention this to give a sense of the costs we are already paying. I > > am not saying they are directly related. > > > > I'll wait a week for discussion and if there is consensus make the > change. > > > > Regards, > > Ariel > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > -- > Ben Bromhead > CTO | Instaclustr <https://www.instaclustr.com/> > +1 650 284 9692 > Reliability at Scale > Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer >