> Ah excuse my confusion. I now understand I guide you through changing the > throughput when you wanted to change the compaction throughput.
Wow, I meant to say "I guided you through changing the compaction throughput when you wanted to change the number of concurrent compactors." I should not answer messages before waking up fully... :) C*heers, ----------------------- Alain Rodriguez - @arodream - al...@thelastpickle.com France / Spain The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2018-07-19 14:07 GMT+01:00 Alain RODRIGUEZ <arodr...@gmail.com>: > Ah excuse my confusion. I now understand I guide you through changing the > throughput when you wanted to change the compaction throughput. > > I also found some commands I ran in the past using jmxterm. As mentioned > by Chris - and thanks Chris for answering the question properly -, the > 'max' can never be lower than the 'core'. > > Use JMXTERM to REDUCE the concurrent compactors: > > ``` > # if we have more than 2 threads: > echo "set -b org.apache.cassandra.db:type=CompactionManager > CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l > 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager > MaximumCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l > 127.0.0.1:7199 > ``` > > Use JMXTERM to INCREASE the concurrent compactors: > > ``` > # if we have currently less than 6 threads: > echo "set -b org.apache.cassandra.db:type=CompactionManager > MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l > 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager > CoreCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l > 127.0.0.1:7199 > ``` > > Some comments about the information you shared, as you said, 'thinking out > loud' :): > > *About the hardware* > > I remember using the 'm1.xlarge' :). They are not that recent. It will > probably worth it to reconsider this hardware choice and migrate to newer > hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to > reduce the number of nodes and make it equivalent (or maybe slightly more > expensive but so it works properly). I once moved from a lot of these nodes > (80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from > 20 ms to 3 - 5 ms (and was improved later on). Also, using the right > hardware for your case should avoid headaches to you and your team. I > started with t1.micro in prod and went all the way up (m1.small, m1.medium, > ...). It's good for learning, not for business. > > Especially, this does not work well together: > > my instances are still on magnetic drivers >> > > with > > most tables on LCS > > frequent r/w pattern >> > > Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would most > probably help to reduce the latency. I would also pick an instance with > more memory (30 GB would probably be more comfortable). The more memory, > the better it is possible to tune the JVM and the more page caching can be > done (thus avoiding some disk reads). Given the number of nodes you use, > it's complex to keep the cost low doing this change. When the cluster will > grow you might want to consider changing the instance type again and maybe > for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB of > memory and the same number of cpu (or more) and see how many nodes are > needed. It might be slightly more expensive, but I really believe it could > do some good. > > As a middle term solution, I think you might be really happy with a change > of this kind. > > *About DTCS/TWCS?* > > >> >> * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS* > > Indeed switching to DTCS rather than TWCS can be a real relief for a > cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I > must say I am not too sure for 3.0.x (x < 8) versions though. Maybe giving > a try to http://thelastpickle.com/blog/2017/01/10/twcs-part2.html with > https://github.com/jeffjirsa/twcs/tree/cassandra-3.0.0 is easier for you? > > *Garbage Collection?* > > That being said, the CPU load is really high, I suspect Garbage Collection > is taking a lot of time to the nodes of this cluster. It is probably not > helping the CPUs either. This might even be the biggest pain point for this > cluster. > > Would you like to try using following settings on a canary node and see > how it goes? These settings are quite arbitrary. With the gc.log I could be > more precise on what I believe is a correct setting. > > GC Type: CMS > Heap: 8 GB (could be bigger, but we are limited by the 15 GB in total). > New_heap: 2 - 4 GB (maybe experiment with the 2 distinct values) > TenuringThreshold: 15 (instead of 1, that is definitely too small and tend > to have short living object still being promoted to the old gen) > > For those settings, I do not trust the cassandra defaults in most cases. > New_heap_size > should be 25-50% of the heap (and not related to the number of CPU cores). > Also below 16 GB I never had a better result with G1GC than CMS. But I must > say I have been fighting a lot with CMS in the past to tune it nicely while > I did not even play much with G1GC. > > This (or similar settings) worked for distinct cases having heavy read > patterns. In the mailing list I explained recently to someone else my > understanding of JVM and GC, also there is a blog post my colleague Jon > wrote here: http://thelastpickle.com/blog/2018/04/11/gc-tuning.html. I > believe he suggested a slightly different tuning. > If none of this is helping, please send the gc.log file over with and > without this change we could have a look what is going on. SurvivorRatio > can also be moved down to 2 or 4, if you want to play around and check the > difference. > > Make sure to use a canary node first, there is no 'good' configuration > here, it really depends on the workload and the settings above could harm > the cluster. > > I think we can make more of these instances. Nonetheless after adding a > few more nodes, scaling up the instance type instead of the number of nodes > to have SSDs and bit more of memory will make things smoother, and probably > cheaper as well at some point. > > > > > 2018-07-18 17:27 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>: > >> Chris, >> >> Thank you for mbean reference. >> >> On Wed, Jul 18, 2018 at 6:26 PM, Riccardo Ferrari <ferra...@gmail.com> >> wrote: >> >>> Alain, thank you for email. I really really appreciate it! >>> >>> I am actually trying to remove the disk io from the suspect list, thus >>> I'm want to reduce the number of concurrent compactors. I'll give >>> thorughput a shot. >>> No, I don't have a long list of pending compactions, however my >>> instances are still on magnetic drivers and can't really afford high number >>> of compactors. >>> >>> We started to have slow downs and most likely we were undersized, new >>> features are coming in and I want to be ready for them. >>> *About the issue:* >>> >>> >>> - High system load on cassanda nodes. This means top saying 6.0/12.0 >>> on a 4 vcpu instance (!) >>> >>> >>> - CPU is high: >>> - Dynatrace says 50% >>> - top easily goes to 80% >>> - Network around 30Mb (according to Dynatrace) >>> - Disks: >>> - ~40 iops >>> - high latency: ~20ms (min 8 max 50!) >>> - negligible iowait >>> - testing an empty instance with fio I get 1200 r_iops / 400 >>> w_iops >>> >>> >>> - Clients timeout >>> - mostly when reading >>> - few cases when writing >>> - Slowly growing number of "All time blocked of Native T-R" >>> - small numbers: hundreds vs millions of successfully serverd >>> requests >>> >>> The system: >>> >>> - Cassandra 3.0.6 >>> - most tables on LCS >>> - frequent r/w pattern >>> - few tables with DTCS >>> - need to upgrade to 3.0.8 for TWCS >>> - mostly TS data, stream write / batch read >>> - All our keyspaces have RF: 3 >>> >>> >>> - All nodes on the same AZ >>> - m1.xlarge >>> - 4x420 drives (emphemerial storage) configured in striping (raid0) >>> - 4 vcpu >>> - 15GB ram >>> - workload: >>> - Java applications; >>> - Mostly feeding cassandra writing data coming in >>> - Apache Spark applications: >>> - batch processes to read and write back to C* or other systems >>> - not co-located >>> >>> So far my effort was put into growing the ring to better distribute the >>> load and decrease the pressure, including: >>> >>> - Increasing the node number from 3 to 5 (6th node joining) >>> - jvm memory "optimization": >>> - heaps were set by default script to something bit smaller that 4GB >>> with CMS gc >>> - gc pressure was high / long gc pauses >>> - clients were suffering of read timeouts >>> - increased the heap still using CMS: >>> - very long GC pauses >>> - not much tuning around CMS >>> - switched to G1 and forced 6/7GB heap on each node using >>> almost suggested settings >>> - much more stable >>> - generally < 300ms >>> - I still have long pauses from time to time (mostly around >>> 1200ms, sometimes on some nodes 3000) >>> >>> *Thinking out loud:* >>> Things are much better, however I still see a high cpu usage specially >>> when Spark kicks even though spark jobs are very small in terms of >>> resources (single worker with very limited parallelism). >>> >>> On LCS tables cfstats reports single digit read latencies and generally >>> 0.X write latencies (as per today). >>> On DTCS tables I have 0.x ms write latency but still double digit read >>> latency, but I guess I should spend some time to tune that or upgrade and >>> move away from DTCS :( >>> Yes, Saprk reads mostly from DTCS tables >>> >>> Still is kinda common to to have dropped READ, HINT and MUTATION. >>> >>> - not on all nodes >>> - this generally happen on node restart >>> >>> >>> On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed >>> 14.04 and 16.04) with terrible results, much slower instance startup and >>> responsiveness, how could that be? >>> >>> Once everything will be stabilized I'll prepare our move to vpc and >>> possibly upgrade to i3 instance, any comment on the hardware side? is >>> 4cores still a reasonble hardware? >>> >>> Best, >>> >>> On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ <arodr...@gmail.com> >>> wrote: >>> >>>> Hello Riccardo, >>>> >>>> I noticed I have been writing a novel to answer a simple couple of >>>> questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's >>>> what you were looking for :). Also, there is a warning that it might be >>>> counter-productive and stress the cluster even more to increase the >>>> compaction throughput. There is more information below ('about the issue'). >>>> >>>> *tl;dr*: >>>> >>>> What about using 'nodetool setcompactionthroughput XX' instead. It >>>> should available there. >>>> >>>> In the same way 'nodetool getcompactionthroughput' gives you the >>>> current value. Be aware that this change done through JMX/nodetool is >>>> *not* permanent. You still need to update the cassandra.yaml file. >>>> >>>> If you really want to use the MBean through JMX, because using >>>> 'nodetool' is too easy (or for any other reason :p): >>>> >>>> Mbean: org.apache.cassandra.service.StorageServiceMBean >>>> Attribute: CompactionThroughputMbPerSec >>>> >>>> *Long story* with the "how to" since I went through this search >>>> myself, I did not know where this MBean was. >>>> >>>> Can someone point me to the right mbean? >>>>> I can not really find good docs about mbeans (or tools ...) >>>> >>>> >>>> I am not sure about the doc, but you can use jmxterm ( >>>> http://wiki.cyclopsgroup.org/jmxterm/download.html). >>>> >>>> To replace the doc I use CCM (https://github.com/riptano/ccm) + >>>> jconsole to find the mbeans locally: >>>> >>>> * Add loopback addresses for ccm (see the readme file) >>>> * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3 >>>> -s' >>>> * Start jconsole using the right pid: 'jconsole $(ccm node1 show | >>>> grep pid | cut -d "=" -f 2)' >>>> * Explore MBeans, try to guess where this could be (and discover other >>>> funny stuff in there :)). >>>> >>>> I must admit I did not find it this way using C*3.0.6 and jconsole. >>>> I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI >>>> CompactionThroughput' with this result: https://gist.github.co >>>> m/arodrime/f9591e4bdd2b1367a496447cdd959006 >>>> >>>> With this I could find the right MBean, the only code documentation >>>> that is always up to date is the code itself I am afraid: >>>> >>>> './src/java/org/apache/cassandra/service/StorageServiceMBean.java: >>>> public void setCompactionThroughputMbPerSec(int value);' >>>> >>>> Note that the research in the code also leads to nodetool ;-). >>>> >>>> I could finally find the MBean in the 'jconsole' too: >>>> https://cdn.pbrd.co/images/HuUya3x.png (not sure how long this link >>>> will live). >>>> >>>> jconsole also allows you to see what attributes it is possible to set >>>> or not. >>>> >>>> You can now find any other MBean you would need I hope :). >>>> >>>> >>>> see if it helps when the system is under stress >>>> >>>> >>>> *About the issue* >>>> >>>> You don't exactly say what you are observing, what is that "stress"? >>>> How is it impacting the cluster? >>>> >>>> I ask because I am afraid this change might not help and even be >>>> counter-productive. Even though having SSTables nicely compacted make a >>>> huge difference at the read time, if that's already the case for you and >>>> the data is already nicely compacted, doing this change won't help. It >>>> might even make things slightly worse if the current bottleneck is the disk >>>> IO during a stress period as the compactors would increase their disk read >>>> throughput, thus maybe fight with the read requests for disk throughput. >>>> >>>> If you have a similar number of sstables on all nodes, not many >>>> compactions pending (nodetool netstats -H) and read operations are hitting >>>> a small number sstables (nodetool tablehistogram) then you probably >>>> don't need to increase the compaction speed. >>>> >>>> Let's say that the compaction throughput is not often the cause of >>>> stress during peak hours nor a direct way to make things 'faster'. >>>> Generally when compaction goes wrong, the number of sstables goes *t* >>>> *hrou**g**h* the roof. If you have a chart showing the number >>>> sstables, you can see this really well. >>>> >>>> Of course, if you feel you are in this case, increasing the compaction >>>> throughput will definitely help if the cluster also has spared disk >>>> throughput. >>>> >>>> To check what's wrong, if you believe it's something different, here >>>> are some useful commands: >>>> >>>> - nodetool tpstats (check for pending/blocked/dropped threads there) >>>> - check WARN and ERRORS in the logs (ie. grep -e "WARN" -e "ERROR" >>>> /var/log/cassandra/system.log) >>>> - Check local latencies (nodetool tablestats / nodetool tablehistogram) >>>> and compare it to the client request latency. At the node level, reads >>>> should probably be a single digit in milliseconds, rather close to 1 ms >>>> with SSDs and writes below the millisecond most probably (it depends on the >>>> data size too, etc...). >>>> - Trace a query during this period, see what takes time (for example >>>> from 'cqlsh' - 'TRACING ON; SELECT ...') >>>> >>>> You can also analyze the *Garbage Collection* activity. As Cassandre >>>> uses the JVM, a badly tuned GC will induce long pauses. Depending on the >>>> workload, and I must say for most of the cluster I work on, default the >>>> tuning is not that good and can keep server busy 10-15% of the time with >>>> stop the world GC. >>>> You might find this post of my colleague Jon about GC tuning for >>>> Apache Cassandra interesting: http://thelastpic >>>> kle.com/blog/2018/04/11/gc-tuning.html. GC pressure is a very common >>>> way to optimize a Cassandra cluster, to adapt it to your workload/hardware. >>>> >>>> C*heers, >>>> ----------------------- >>>> Alain Rodriguez - @arodream - al...@thelastpickle.com >>>> France / Spain >>>> >>>> The Last Pickle - Apache Cassandra Consulting >>>> http://www.thelastpickle.com >>>> >>>> >>>> 2018-07-17 17:23 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>: >>>> >>>>> Hi list, >>>>> >>>>> Cassandra 3.0.6 >>>>> >>>>> I'd like to test the change of concurrent compactors to see if it >>>>> helps when the system is under stress. >>>>> >>>>> Can someone point me to the right mbean? >>>>> I can not really find good docs about mbeans (or tools ...) >>>>> >>>>> Any suggestion much appreciated, best >>>>> >>>> >>>> >>> >> >