Re: concurrent_compactors via JMX

Riccardo Ferrari Wed, 18 Jul 2018 09:27:41 -0700

Chris,

Thank you for mbean reference.


On Wed, Jul 18, 2018 at 6:26 PM, Riccardo Ferrari <ferra...@gmail.com>
wrote:

> Alain, thank you for email. I really really appreciate it!
>
> I am actually trying to remove the disk io from the suspect list, thus I'm
> want to reduce the number of concurrent compactors. I'll give thorughput a
> shot.
> No, I don't have a long list of pending compactions, however my instances
> are still on magnetic drivers and can't really afford high number of
> compactors.
>
> We started to have slow downs and most likely we were undersized, new
> features are coming in and I want to be ready for them.
> *About the issue:*
>
>
>    - High system load on cassanda nodes. This means top saying 6.0/12.0
>    on a 4 vcpu instance (!)
>
>
>    - CPU is high:
>          - Dynatrace says 50%
>          - top easily goes to 80%
>       - Network around 30Mb (according to Dynatrace)
>       - Disks:
>          - ~40 iops
>          - high latency: ~20ms (min 8 max 50!)
>          - negligible iowait
>          - testing an empty instance with fio I get 1200 r_iops / 400
>          w_iops
>
>
>    - Clients timeout
>       - mostly when reading
>       - few cases when writing
>    - Slowly growing number of "All time blocked of Native T-R"
>       - small numbers: hundreds vs millions of successfully serverd
>       requests
>
> The system:
>
>    - Cassandra 3.0.6
>       - most tables on LCS
>          - frequent r/w pattern
>       - few tables with DTCS
>          - need to upgrade to 3.0.8 for TWCS
>          - mostly TS data, stream write / batch read
>       - All our keyspaces have RF: 3
>
>
>    - All nodes on the same AZ
>    - m1.xlarge
>    - 4x420 drives (emphemerial storage) configured in striping (raid0)
>       - 4 vcpu
>       - 15GB ram
>    - workload:
>       - Java applications;
>          - Mostly feeding cassandra writing data coming in
>          - Apache Spark applications:
>          - batch processes to read and write back to C* or other systems
>          - not co-located
>
> So far my effort was put into growing the ring to better distribute the
> load and decrease the pressure, including:
>
>    - Increasing the node number from 3 to 5 (6th node joining)
>    - jvm memory "optimization":
>    - heaps were set by default script to something bit smaller that 4GB
>       with CMS gc
>       - gc pressure was high / long gc pauses
>          - clients were suffering of read timeouts
>       - increased the heap still using CMS:
>          - very long GC pauses
>          - not much tuning around CMS
>          - switched to G1 and forced 6/7GB heap on each node using almost
>       suggested settings
>       - much more stable
>             - generally < 300ms
>          - I still have long pauses from time to time (mostly around
>          1200ms, sometimes on some nodes 3000)
>
> *Thinking out loud:*
> Things are much better, however I still see a high cpu usage specially
> when Spark kicks even though spark jobs are very small in terms of
> resources (single worker with very limited parallelism).
>
> On LCS tables cfstats reports single digit read latencies and generally
> 0.X write latencies (as per today).
> On DTCS tables I have 0.x ms write latency but still double digit read
> latency, but I guess I should spend some time to tune that or upgrade and
> move away from DTCS :(
> Yes, Saprk reads mostly from DTCS tables
>
> Still is kinda common to to have dropped READ, HINT and MUTATION.
>
>    - not on all nodes
>    - this generally happen on node restart
>
>
> On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed
> 14.04 and 16.04) with terrible results, much slower instance startup and
> responsiveness, how could that be?
>
> Once everything will be stabilized I'll prepare our move to vpc and
> possibly upgrade to i3 instance, any comment on the hardware side?  is
> 4cores still a reasonble hardware?
>
> Best,
>
> On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ <arodr...@gmail.com>
> wrote:
>
>> Hello Riccardo,
>>
>> I noticed I have been writing a novel to answer a simple couple of
>> questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's
>> what you were looking for :). Also, there is a warning that it might be
>> counter-productive and stress the cluster even more to increase the
>> compaction throughput. There is more information below ('about the issue').
>>
>> *tl;dr*:
>>
>> What about using 'nodetool setcompactionthroughput XX' instead. It
>> should available there.
>>
>> In the same way 'nodetool getcompactionthroughput' gives you the current
>> value. Be aware that this change done through JMX/nodetool is *not* 
>> permanent.
>> You still need to update the cassandra.yaml file.
>>
>> If you really want to use the MBean through JMX, because using 'nodetool'
>> is too easy (or for any other reason :p):
>>
>> Mbean: org.apache.cassandra.service.StorageServiceMBean
>> Attribute: CompactionThroughputMbPerSec
>>
>> *Long story* with the "how to" since I went through this search myself,
>> I did not know where this MBean was.
>>
>> Can someone point me to the right mbean?
>>> I can not really find good docs about mbeans (or tools ...)
>>
>>
>> I am not sure about the doc, but you can use jmxterm (
>> http://wiki.cyclopsgroup.org/jmxterm/download.html).
>>
>> To replace the doc I use CCM (https://github.com/riptano/ccm) +
>> jconsole to find the mbeans locally:
>>
>> * Add loopback addresses for ccm (see the readme file)
>> * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3
>> -s'
>> * Start jconsole using the right pid: 'jconsole $(ccm node1 show | grep
>> pid | cut -d "=" -f 2)'
>> * Explore MBeans, try to guess where this could be (and discover other
>> funny stuff in there :)).
>>
>> I must admit I did not find it this way using C*3.0.6 and jconsole.
>> I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI
>> CompactionThroughput' with this result: https://gist.github.co
>> m/arodrime/f9591e4bdd2b1367a496447cdd959006
>>
>> With this I could find the right MBean, the only code documentation that
>> is always up to date is the code itself I am afraid:
>>
>> './src/java/org/apache/cassandra/service/StorageServiceMBean.java:
>> public void setCompactionThroughputMbPerSec(int value);'
>>
>> Note that the research in the code also leads to nodetool ;-).
>>
>> I could finally find the MBean in the 'jconsole' too:
>> https://cdn.pbrd.co/images/HuUya3x.png (not sure how long this link will
>> live).
>>
>> jconsole also allows you to see what attributes it is possible to set or
>> not.
>>
>> You can now find any other MBean you would need I hope :).
>>
>>
>> see if it helps when the system is under stress
>>
>>
>> *About the issue*
>>
>> You don't exactly say what you are observing, what is that "stress"? How
>> is it impacting the cluster?
>>
>> I ask because I am afraid this change might not help and even be
>> counter-productive. Even though having SSTables nicely compacted make a
>> huge difference at the read time, if that's already the case for you and
>> the data is already nicely compacted, doing this change won't help. It
>> might even make things slightly worse if the current bottleneck is the disk
>> IO during a stress period as the compactors would increase their disk read
>> throughput, thus maybe fight with the read requests for disk throughput.
>>
>> If you have a similar number of sstables on all nodes, not many
>> compactions pending (nodetool netstats -H) and read operations are hitting
>> a small number sstables (nodetool tablehistogram) then you probably
>> don't need to increase the compaction speed.
>>
>> Let's say that the compaction throughput is not often the cause of stress
>> during peak hours nor a direct way to make things 'faster'. Generally when
>> compaction goes wrong, the number of sstables goes *t**hrou**g**h* the
>> roof. If you have a chart showing the number sstables, you can see this
>> really well.
>>
>> Of course, if you feel you are in this case, increasing the compaction
>> throughput will definitely help if the cluster also has spared disk
>> throughput.
>>
>> To check what's wrong, if you believe it's something different, here are
>> some useful commands:
>>
>> - nodetool tpstats (check for pending/blocked/dropped threads there)
>> - check WARN and ERRORS in the logs (ie. grep -e "WARN" -e "ERROR"
>> /var/log/cassandra/system.log)
>> - Check local latencies (nodetool tablestats / nodetool tablehistogram)
>> and compare it to the client request latency. At the node level, reads
>> should probably be a single digit in milliseconds, rather close to 1 ms
>> with SSDs and writes below the millisecond most probably (it depends on the
>> data size too, etc...).
>> - Trace a query during this period, see what takes time (for example
>> from  'cqlsh' - 'TRACING ON; SELECT ...')
>>
>> You can also analyze the *Garbage Collection* activity. As Cassandre
>> uses the JVM, a badly tuned GC will induce long pauses. Depending on the
>> workload, and I must say for most of the cluster I work on, default the
>> tuning is not that good and can keep server busy 10-15% of the time with
>> stop the world GC.
>> You might find this post of my colleague Jon about GC tuning for Apache
>> Cassandra interesting: http://thelastpickle.com/blog/2018/04/11/gc-
>> tuning.html. GC pressure is a very common way to optimize a Cassandra
>> cluster, to adapt it to your workload/hardware.
>>
>> C*heers,
>> -----------------------
>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>> 2018-07-17 17:23 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>:
>>
>>> Hi list,
>>>
>>> Cassandra 3.0.6
>>>
>>> I'd like to test the change of concurrent compactors to see if it helps
>>> when the system is under stress.
>>>
>>> Can someone point me to the right mbean?
>>> I can not really find good docs about mbeans (or tools ...)
>>>
>>> Any suggestion much appreciated, best
>>>
>>
>>
>

Re: concurrent_compactors via JMX

Reply via email to