Re: concurrent_compactors via JMX

Riccardo Ferrari Wed, 18 Jul 2018 09:26:46 -0700

Alain, thank you for email. I really really appreciate it!

I am actually trying to remove the disk io from the suspect list, thus I'm
want to reduce the number of concurrent compactors. I'll give thorughput a
shot.
No, I don't have a long list of pending compactions, however my instances
are still on magnetic drivers and can't really afford high number of
compactors.


We started to have slow downs and most likely we were undersized, new
features are coming in and I want to be ready for them.
*About the issue:*


   - High system load on cassanda nodes. This means top saying 6.0/12.0 on
   a 4 vcpu instance (!)


   - CPU is high:
         - Dynatrace says 50%
         - top easily goes to 80%
      - Network around 30Mb (according to Dynatrace)
      - Disks:
         - ~40 iops
         - high latency: ~20ms (min 8 max 50!)
         - negligible iowait
         - testing an empty instance with fio I get 1200 r_iops / 400 w_iops


   - Clients timeout
      - mostly when reading
      - few cases when writing
   - Slowly growing number of "All time blocked of Native T-R"
      - small numbers: hundreds vs millions of successfully serverd requests

The system:

   - Cassandra 3.0.6
      - most tables on LCS
         - frequent r/w pattern
      - few tables with DTCS
         - need to upgrade to 3.0.8 for TWCS
         - mostly TS data, stream write / batch read
      - All our keyspaces have RF: 3


   - All nodes on the same AZ
   - m1.xlarge
   - 4x420 drives (emphemerial storage) configured in striping (raid0)
      - 4 vcpu
      - 15GB ram
   - workload:
      - Java applications;
         - Mostly feeding cassandra writing data coming in
         - Apache Spark applications:
         - batch processes to read and write back to C* or other systems
         - not co-located

So far my effort was put into growing the ring to better distribute the
load and decrease the pressure, including:

   - Increasing the node number from 3 to 5 (6th node joining)
   - jvm memory "optimization":
   - heaps were set by default script to something bit smaller that 4GB
      with CMS gc
      - gc pressure was high / long gc pauses
         - clients were suffering of read timeouts
      - increased the heap still using CMS:
         - very long GC pauses
         - not much tuning around CMS
         - switched to G1 and forced 6/7GB heap on each node using almost
      suggested settings
      - much more stable
            - generally < 300ms
         - I still have long pauses from time to time (mostly around
         1200ms, sometimes on some nodes 3000)

*Thinking out loud:*
Things are much better, however I still see a high cpu usage specially when
Spark kicks even though spark jobs are very small in terms of resources
(single worker with very limited parallelism).

On LCS tables cfstats reports single digit read latencies and generally 0.X
write latencies (as per today).
On DTCS tables I have 0.x ms write latency but still double digit read
latency, but I guess I should spend some time to tune that or upgrade and
move away from DTCS :(
Yes, Saprk reads mostly from DTCS tables

Still is kinda common to to have dropped READ, HINT and MUTATION.

   - not on all nodes
   - this generally happen on node restart


On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed
14.04 and 16.04) with terrible results, much slower instance startup and
responsiveness, how could that be?

Once everything will be stabilized I'll prepare our move to vpc and
possibly upgrade to i3 instance, any comment on the hardware side?  is
4cores still a reasonble hardware?

Best,

On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ <arodr...@gmail.com> wrote:

> Hello Riccardo,
>
> I noticed I have been writing a novel to answer a simple couple of
> questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's
> what you were looking for :). Also, there is a warning that it might be
> counter-productive and stress the cluster even more to increase the
> compaction throughput. There is more information below ('about the issue').
>
> *tl;dr*:
>
> What about using 'nodetool setcompactionthroughput XX' instead. It should
> available there.
>
> In the same way 'nodetool getcompactionthroughput' gives you the current
> value. Be aware that this change done through JMX/nodetool is *not* permanent.
> You still need to update the cassandra.yaml file.
>
> If you really want to use the MBean through JMX, because using 'nodetool'
> is too easy (or for any other reason :p):
>
> Mbean: org.apache.cassandra.service.StorageServiceMBean
> Attribute: CompactionThroughputMbPerSec
>
> *Long story* with the "how to" since I went through this search myself, I
> did not know where this MBean was.
>
> Can someone point me to the right mbean?
>> I can not really find good docs about mbeans (or tools ...)
>
>
> I am not sure about the doc, but you can use jmxterm (
> http://wiki.cyclopsgroup.org/jmxterm/download.html).
>
> To replace the doc I use CCM (https://github.com/riptano/ccm) +
> jconsole to find the mbeans locally:
>
> * Add loopback addresses for ccm (see the readme file)
> * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3 -s'
> * Start jconsole using the right pid: 'jconsole $(ccm node1 show | grep
> pid | cut -d "=" -f 2)'
> * Explore MBeans, try to guess where this could be (and discover other
> funny stuff in there :)).
>
> I must admit I did not find it this way using C*3.0.6 and jconsole.
> I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI
> CompactionThroughput' with this result: https://gist.github.com/arodrime/
> f9591e4bdd2b1367a496447cdd959006
>
> With this I could find the right MBean, the only code documentation that
> is always up to date is the code itself I am afraid:
>
> './src/java/org/apache/cassandra/service/StorageServiceMBean.java:
> public void setCompactionThroughputMbPerSec(int value);'
>
> Note that the research in the code also leads to nodetool ;-).
>
> I could finally find the MBean in the 'jconsole' too: https://cdn.pbrd.co/
> images/HuUya3x.png (not sure how long this link will live).
>
> jconsole also allows you to see what attributes it is possible to set or
> not.
>
> You can now find any other MBean you would need I hope :).
>
>
> see if it helps when the system is under stress
>
>
> *About the issue*
>
> You don't exactly say what you are observing, what is that "stress"? How
> is it impacting the cluster?
>
> I ask because I am afraid this change might not help and even be
> counter-productive. Even though having SSTables nicely compacted make a
> huge difference at the read time, if that's already the case for you and
> the data is already nicely compacted, doing this change won't help. It
> might even make things slightly worse if the current bottleneck is the disk
> IO during a stress period as the compactors would increase their disk read
> throughput, thus maybe fight with the read requests for disk throughput.
>
> If you have a similar number of sstables on all nodes, not many
> compactions pending (nodetool netstats -H) and read operations are hitting
> a small number sstables (nodetool tablehistogram) then you probably don't
> need to increase the compaction speed.
>
> Let's say that the compaction throughput is not often the cause of stress
> during peak hours nor a direct way to make things 'faster'. Generally when
> compaction goes wrong, the number of sstables goes *t**hrou**g**h* the
> roof. If you have a chart showing the number sstables, you can see this
> really well.
>
> Of course, if you feel you are in this case, increasing the compaction
> throughput will definitely help if the cluster also has spared disk
> throughput.
>
> To check what's wrong, if you believe it's something different, here are
> some useful commands:
>
> - nodetool tpstats (check for pending/blocked/dropped threads there)
> - check WARN and ERRORS in the logs (ie. grep -e "WARN" -e "ERROR"
> /var/log/cassandra/system.log)
> - Check local latencies (nodetool tablestats / nodetool tablehistogram)
> and compare it to the client request latency. At the node level, reads
> should probably be a single digit in milliseconds, rather close to 1 ms
> with SSDs and writes below the millisecond most probably (it depends on the
> data size too, etc...).
> - Trace a query during this period, see what takes time (for example from
> 'cqlsh' - 'TRACING ON; SELECT ...')
>
> You can also analyze the *Garbage Collection* activity. As Cassandre uses
> the JVM, a badly tuned GC will induce long pauses. Depending on the
> workload, and I must say for most of the cluster I work on, default the
> tuning is not that good and can keep server busy 10-15% of the time with
> stop the world GC.
> You might find this post of my colleague Jon about GC tuning for Apache
> Cassandra interesting: http://thelastpickle.com/blog/2018/
> 04/11/gc-tuning.html. GC pressure is a very common way to optimize a
> Cassandra cluster, to adapt it to your workload/hardware.
>
> C*heers,
> -----------------------
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> 2018-07-17 17:23 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>:
>
>> Hi list,
>>
>> Cassandra 3.0.6
>>
>> I'd like to test the change of concurrent compactors to see if it helps
>> when the system is under stress.
>>
>> Can someone point me to the right mbean?
>> I can not really find good docs about mbeans (or tools ...)
>>
>> Any suggestion much appreciated, best
>>
>
>

Re: concurrent_compactors via JMX

Reply via email to