The default when I wrote it was 0.4 but it was found this did not saturate flush writers in JBOD configurations. Iirc it now defaults to 1/(1+#disks) which is not a terrible default, but obviously comes out much lower if you have many disks.
This smaller value behaves better for peak performance, but in a live system where compaction is king not saturating flush in return for lower write amplification (from flushing larger memtables) will indeed often be a win. 0.6, however, is probably not the best default unless you have a lot of tables being actively written to, in which case even 0.8 would be fine. With a single main table receiving your writes at a given time, 0.4 is probably an optimal value, when making this trade off against peak performance. Anyway, it's probably better to file a ticket to discuss defaults and documentation than making a statement like this without justification. I can see where you're coming from, but it's confusing for users to have such blanket guidance that counters the defaults. If the defaults can be improved (which I agree they can) it's probably better to do that, along with better documentation, so the nuance is accounted for. On Friday, 26 August 2016, Ryan Svihla <r...@foundev.pro> wrote: > > Forgot the most important thing. Logs > ERROR you should investigate > WARN you should have a list of known ones. Use case dependent. Ideally you > change configuration accordingly. > *PoolCleaner (slab or native) - good indication node is tuned badly if you > see a ton of this. Set memtable_cleanup_threshold to 0.6 as an initial > attempt to configure this correctly. This is a complex topic to dive into, > so that may not be the best number, it'll likely be better than the > default, why its not the default is a big conversation. > There are a bunch of other logs I look for that are escaping me at present > but that's a good start > > -regards, > > Ryan Svihla > > > > > On Fri, Aug 26, 2016 at 7:21 AM -0500, "Ryan Svihla" <r...@foundev.pro > <javascript:_e(%7B%7D,'cvml','r...@foundev.pro');>> wrote: > > Thomas, >> >> Not all metrics are KPIs and are only useful when researching a specific >> issue or after a use case specific threshold has been set. >> >> The main "canaries" I monitor are: >> * Pending compactions (dependent on the compaction strategy chosen but >> 1000 is a sign of severe issues in all cases) >> * dropped mutations (more than one I treat as a event to investigate, I >> believe in allowing operational overhead and any evidence of load shedding >> suggests I may not have as much as I thought) >> * blocked anything (flush writers, etc..more than one I investigate) >> * system hints ( More than 1k I investigate) >> * heap usage and gc time vary a lot by use case and collector chosen, I >> aim for below 65% usage as an average with g1, but this again varies by use >> case a great deal. Sometimes I just looks the chart and query patterns and >> if they don't line up I have to do other deeper investigations >> * read and write latencies exceeding SLA is also use case dependent. >> Those that have none I tend to push towards p99 with a middle end SSD based >> system having 100ms and a spindle based system having 600ms with CL one and >> assuming a "typical" query pattern (again query patterns and CL so vary >> here) >> * cell count and partition size vary greatly by hardware and gc tuning >> but I like to in the absence of all other relevant information like to keep >> cell count for a partition below 100k and size below 100mb. I however have >> many successful use cases running more and I've had some fail well before >> that. Hardware and tuning tradeoff a shift this around a lot. >> There is unfortunately as you'll note a lot of nuance and the load out >> really changes what looks right (down to the model of SSDs I have different >> expectations for p99s if it's a model I haven't used before I'll do some >> comparative testing). >> >> The reason so much of this is general and vague is my selection bias. I'm >> brought in when people are complaining about performance or some grand >> systemic crash because they were monitoring nothing. I have little ability >> to change hardware initially so I have to be willing to allow the hardware >> to do the best it can an establish levels where it can no longer keep up >> with the customers goals. This may mean for some use cases 10 pending >> compactions is an actionable event for them, for another customer 100 is. >> The better approach is to establish a baseline for when these metrics start >> to indicate a serious issue is occurring in that particular app. Basically >> when people notice a problem, what did these numbers look like in the >> minutes, hours and days prior? That's the way to establish the levels >> consistently. >> >> Regards, >> >> Ryan Svihla >> >> >> >> >> >> >> >> On Fri, Aug 26, 2016 at 4:48 AM -0500, "Thomas Julian" < >> thomasjul...@zoho.com >> <javascript:_e(%7B%7D,'cvml','thomasjul...@zoho.com');>> wrote: >> >> Hello, >>> >>> I am working on setting up a monitoring tool to monitor Cassandra >>> Instances. Are there any wikis which specifies optimum value for each >>> Cassandra KPIs? >>> For instance, I am not sure, >>> >>> 1. What value of "Memtable Columns Count" can be considered as >>> "Normal". >>> 2. What value of the same has to be considered as "Critical". >>> >>> I knew threshold numbers for few params, for instance any thing more >>> than zero for timeouts, pending tasks should be considered as unusual. >>> Also, I am aware that most of the statistics' threshold numbers vary in >>> accordance with Hardware Specification, Cassandra Environment Setup. But, >>> what I request here is a general guideline for configuring thresholds for >>> all the metrics. >>> >>> If this has been already covered, please point me to that resource. If >>> anyone on their own interest collected these things, please share. >>> >>> Any help is appreciated. >>> >>> Best Regards, >>> Julian. >>> >>> >>>