Chris, Thank you for mbean reference.
On Wed, Jul 18, 2018 at 6:26 PM, Riccardo Ferrari <ferra...@gmail.com> wrote: > Alain, thank you for email. I really really appreciate it! > > I am actually trying to remove the disk io from the suspect list, thus I'm > want to reduce the number of concurrent compactors. I'll give thorughput a > shot. > No, I don't have a long list of pending compactions, however my instances > are still on magnetic drivers and can't really afford high number of > compactors. > > We started to have slow downs and most likely we were undersized, new > features are coming in and I want to be ready for them. > *About the issue:* > > > - High system load on cassanda nodes. This means top saying 6.0/12.0 > on a 4 vcpu instance (!) > > > - CPU is high: > - Dynatrace says 50% > - top easily goes to 80% > - Network around 30Mb (according to Dynatrace) > - Disks: > - ~40 iops > - high latency: ~20ms (min 8 max 50!) > - negligible iowait > - testing an empty instance with fio I get 1200 r_iops / 400 > w_iops > > > - Clients timeout > - mostly when reading > - few cases when writing > - Slowly growing number of "All time blocked of Native T-R" > - small numbers: hundreds vs millions of successfully serverd > requests > > The system: > > - Cassandra 3.0.6 > - most tables on LCS > - frequent r/w pattern > - few tables with DTCS > - need to upgrade to 3.0.8 for TWCS > - mostly TS data, stream write / batch read > - All our keyspaces have RF: 3 > > > - All nodes on the same AZ > - m1.xlarge > - 4x420 drives (emphemerial storage) configured in striping (raid0) > - 4 vcpu > - 15GB ram > - workload: > - Java applications; > - Mostly feeding cassandra writing data coming in > - Apache Spark applications: > - batch processes to read and write back to C* or other systems > - not co-located > > So far my effort was put into growing the ring to better distribute the > load and decrease the pressure, including: > > - Increasing the node number from 3 to 5 (6th node joining) > - jvm memory "optimization": > - heaps were set by default script to something bit smaller that 4GB > with CMS gc > - gc pressure was high / long gc pauses > - clients were suffering of read timeouts > - increased the heap still using CMS: > - very long GC pauses > - not much tuning around CMS > - switched to G1 and forced 6/7GB heap on each node using almost > suggested settings > - much more stable > - generally < 300ms > - I still have long pauses from time to time (mostly around > 1200ms, sometimes on some nodes 3000) > > *Thinking out loud:* > Things are much better, however I still see a high cpu usage specially > when Spark kicks even though spark jobs are very small in terms of > resources (single worker with very limited parallelism). > > On LCS tables cfstats reports single digit read latencies and generally > 0.X write latencies (as per today). > On DTCS tables I have 0.x ms write latency but still double digit read > latency, but I guess I should spend some time to tune that or upgrade and > move away from DTCS :( > Yes, Saprk reads mostly from DTCS tables > > Still is kinda common to to have dropped READ, HINT and MUTATION. > > - not on all nodes > - this generally happen on node restart > > > On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed > 14.04 and 16.04) with terrible results, much slower instance startup and > responsiveness, how could that be? > > Once everything will be stabilized I'll prepare our move to vpc and > possibly upgrade to i3 instance, any comment on the hardware side? is > 4cores still a reasonble hardware? > > Best, > > On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ <arodr...@gmail.com> > wrote: > >> Hello Riccardo, >> >> I noticed I have been writing a novel to answer a simple couple of >> questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's >> what you were looking for :). Also, there is a warning that it might be >> counter-productive and stress the cluster even more to increase the >> compaction throughput. There is more information below ('about the issue'). >> >> *tl;dr*: >> >> What about using 'nodetool setcompactionthroughput XX' instead. It >> should available there. >> >> In the same way 'nodetool getcompactionthroughput' gives you the current >> value. Be aware that this change done through JMX/nodetool is *not* >> permanent. >> You still need to update the cassandra.yaml file. >> >> If you really want to use the MBean through JMX, because using 'nodetool' >> is too easy (or for any other reason :p): >> >> Mbean: org.apache.cassandra.service.StorageServiceMBean >> Attribute: CompactionThroughputMbPerSec >> >> *Long story* with the "how to" since I went through this search myself, >> I did not know where this MBean was. >> >> Can someone point me to the right mbean? >>> I can not really find good docs about mbeans (or tools ...) >> >> >> I am not sure about the doc, but you can use jmxterm ( >> http://wiki.cyclopsgroup.org/jmxterm/download.html). >> >> To replace the doc I use CCM (https://github.com/riptano/ccm) + >> jconsole to find the mbeans locally: >> >> * Add loopback addresses for ccm (see the readme file) >> * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3 >> -s' >> * Start jconsole using the right pid: 'jconsole $(ccm node1 show | grep >> pid | cut -d "=" -f 2)' >> * Explore MBeans, try to guess where this could be (and discover other >> funny stuff in there :)). >> >> I must admit I did not find it this way using C*3.0.6 and jconsole. >> I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI >> CompactionThroughput' with this result: https://gist.github.co >> m/arodrime/f9591e4bdd2b1367a496447cdd959006 >> >> With this I could find the right MBean, the only code documentation that >> is always up to date is the code itself I am afraid: >> >> './src/java/org/apache/cassandra/service/StorageServiceMBean.java: >> public void setCompactionThroughputMbPerSec(int value);' >> >> Note that the research in the code also leads to nodetool ;-). >> >> I could finally find the MBean in the 'jconsole' too: >> https://cdn.pbrd.co/images/HuUya3x.png (not sure how long this link will >> live). >> >> jconsole also allows you to see what attributes it is possible to set or >> not. >> >> You can now find any other MBean you would need I hope :). >> >> >> see if it helps when the system is under stress >> >> >> *About the issue* >> >> You don't exactly say what you are observing, what is that "stress"? How >> is it impacting the cluster? >> >> I ask because I am afraid this change might not help and even be >> counter-productive. Even though having SSTables nicely compacted make a >> huge difference at the read time, if that's already the case for you and >> the data is already nicely compacted, doing this change won't help. It >> might even make things slightly worse if the current bottleneck is the disk >> IO during a stress period as the compactors would increase their disk read >> throughput, thus maybe fight with the read requests for disk throughput. >> >> If you have a similar number of sstables on all nodes, not many >> compactions pending (nodetool netstats -H) and read operations are hitting >> a small number sstables (nodetool tablehistogram) then you probably >> don't need to increase the compaction speed. >> >> Let's say that the compaction throughput is not often the cause of stress >> during peak hours nor a direct way to make things 'faster'. Generally when >> compaction goes wrong, the number of sstables goes *t**hrou**g**h* the >> roof. If you have a chart showing the number sstables, you can see this >> really well. >> >> Of course, if you feel you are in this case, increasing the compaction >> throughput will definitely help if the cluster also has spared disk >> throughput. >> >> To check what's wrong, if you believe it's something different, here are >> some useful commands: >> >> - nodetool tpstats (check for pending/blocked/dropped threads there) >> - check WARN and ERRORS in the logs (ie. grep -e "WARN" -e "ERROR" >> /var/log/cassandra/system.log) >> - Check local latencies (nodetool tablestats / nodetool tablehistogram) >> and compare it to the client request latency. At the node level, reads >> should probably be a single digit in milliseconds, rather close to 1 ms >> with SSDs and writes below the millisecond most probably (it depends on the >> data size too, etc...). >> - Trace a query during this period, see what takes time (for example >> from 'cqlsh' - 'TRACING ON; SELECT ...') >> >> You can also analyze the *Garbage Collection* activity. As Cassandre >> uses the JVM, a badly tuned GC will induce long pauses. Depending on the >> workload, and I must say for most of the cluster I work on, default the >> tuning is not that good and can keep server busy 10-15% of the time with >> stop the world GC. >> You might find this post of my colleague Jon about GC tuning for Apache >> Cassandra interesting: http://thelastpickle.com/blog/2018/04/11/gc- >> tuning.html. GC pressure is a very common way to optimize a Cassandra >> cluster, to adapt it to your workload/hardware. >> >> C*heers, >> ----------------------- >> Alain Rodriguez - @arodream - al...@thelastpickle.com >> France / Spain >> >> The Last Pickle - Apache Cassandra Consulting >> http://www.thelastpickle.com >> >> >> 2018-07-17 17:23 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>: >> >>> Hi list, >>> >>> Cassandra 3.0.6 >>> >>> I'd like to test the change of concurrent compactors to see if it helps >>> when the system is under stress. >>> >>> Can someone point me to the right mbean? >>> I can not really find good docs about mbeans (or tools ...) >>> >>> Any suggestion much appreciated, best >>> >> >> >