The default when I wrote it was 0.4 but it was found this did not saturate
flush writers in JBOD configurations. Iirc it now defaults to 1/(1+#disks)
which is not a terrible default, but obviously comes out much lower if you
have many disks.

This smaller value behaves better for peak performance, but in a live
system where compaction is king not saturating flush in return for lower
write amplification (from flushing larger memtables) will indeed often be a
win.

0.6, however, is probably not the best default unless you have a lot of
tables being actively written to, in which case even 0.8 would be fine.
With a single main table receiving your writes at a given time, 0.4 is
probably an optimal value, when making this trade off against peak
performance.

Anyway, it's probably better to file a ticket to discuss defaults and
documentation than making a statement like this without justification. I
can see where you're coming from, but it's confusing for users to have such
blanket guidance that counters the defaults.  If the defaults can be
improved (which I agree they can) it's probably better to do that, along
with better documentation, so the nuance is accounted for.


On Friday, 26 August 2016, Ryan Svihla <r...@foundev.pro> wrote:

>
> Forgot the most important thing. Logs
> ERROR you should investigate
> WARN you should have a list of known ones. Use case dependent. Ideally you
> change configuration accordingly.
> *PoolCleaner (slab or native) - good indication node is tuned badly if you
> see a ton of this. Set memtable_cleanup_threshold to 0.6 as an initial
> attempt to configure this correctly.  This is a complex topic to dive into,
> so that may not be the best number, it'll likely be better than the
> default, why its not the default is a big conversation.
> There are a bunch of other logs I look for that are escaping me at present
> but that's a good start
>
> -regards,
>
> Ryan Svihla
>
>
>
>
> On Fri, Aug 26, 2016 at 7:21 AM -0500, "Ryan Svihla" <r...@foundev.pro
> <javascript:_e(%7B%7D,'cvml','r...@foundev.pro');>> wrote:
>
> Thomas,
>>
>> Not all metrics are KPIs and are only useful when researching a specific
>> issue or after a use case specific threshold has been set.
>>
>> The main "canaries" I monitor are:
>> * Pending compactions (dependent on the compaction strategy chosen but
>> 1000 is a sign of severe issues in all cases)
>> * dropped mutations (more than one I treat as a event to investigate, I
>> believe in allowing operational overhead and any evidence of load shedding
>> suggests I may not have as much as I thought)
>> * blocked anything (flush writers, etc..more than one I investigate)
>> * system hints ( More than 1k I investigate)
>> * heap usage and gc time vary a lot by use case and collector chosen, I
>> aim for below 65% usage as an average with g1, but this again varies by use
>> case a great deal. Sometimes I just looks the chart and query patterns and
>> if they don't line up I have to do other deeper investigations
>> * read and write latencies exceeding SLA is also use case dependent.
>> Those that have none I tend to push towards p99 with a middle end SSD based
>> system having 100ms and a spindle based system having 600ms with CL one and
>> assuming a "typical" query pattern (again query patterns and CL so vary
>> here)
>> * cell count and partition size vary greatly by hardware and gc tuning
>> but I like to in the absence of all other relevant information like to keep
>> cell count for a partition below 100k and size below 100mb. I however have
>> many successful use cases running more and I've had some fail well before
>> that. Hardware and tuning tradeoff a shift this around a lot.
>> There is unfortunately as you'll note a lot of nuance and the load out
>> really changes what looks right (down to the model of SSDs I have different
>> expectations for p99s if it's a model I haven't used before I'll do some
>> comparative testing).
>>
>> The reason so much of this is general and vague is my selection bias. I'm
>> brought in when people are complaining about performance or some grand
>> systemic crash because they were monitoring nothing. I have little ability
>> to change hardware initially so I have to be willing to allow the hardware
>> to do the best it can an establish levels where it can no longer keep up
>> with the customers goals. This may mean for some use cases 10 pending
>> compactions is an actionable event for them, for another customer 100 is.
>> The better approach is to establish a baseline for when these metrics start
>> to indicate a serious issue is occurring in that particular app. Basically
>> when people notice a problem, what did these numbers look like in the
>> minutes, hours and days prior? That's the way to establish the levels
>> consistently.
>>
>> Regards,
>>
>> Ryan Svihla
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Aug 26, 2016 at 4:48 AM -0500, "Thomas Julian" <
>> thomasjul...@zoho.com
>> <javascript:_e(%7B%7D,'cvml','thomasjul...@zoho.com');>> wrote:
>>
>> Hello,
>>>
>>> I am working on setting up a monitoring tool to monitor Cassandra
>>> Instances. Are there any wikis which specifies optimum value for each
>>> Cassandra KPIs?
>>> For instance, I am not sure,
>>>
>>>    1. What value of "Memtable Columns Count" can be considered as
>>>    "Normal".
>>>    2. What value of the same has to be considered as "Critical".
>>>
>>> I knew threshold numbers for few params, for instance any thing more
>>> than zero for timeouts, pending tasks should be considered as unusual.
>>> Also, I am aware that most of the statistics' threshold numbers vary in
>>> accordance with Hardware Specification, Cassandra Environment Setup. But,
>>> what I request here is a general guideline for configuring thresholds for
>>> all the metrics.
>>>
>>> If this has been already covered, please point me to that resource. If
>>> anyone on their own interest collected these things, please share.
>>>
>>> Any help is appreciated.
>>>
>>> Best Regards,
>>> Julian.
>>>
>>>
>>>

Reply via email to