Re: concurrent_compactors via JMX

Alain RODRIGUEZ Thu, 19 Jul 2018 09:25:28 -0700

> Ah excuse my confusion. I now understand I guide you through changing the
> throughput when you wanted to change the compaction throughput.




Wow, I meant to say "I guided you through changing the compaction
throughput when you wanted to change the number of concurrent compactors."

I should not answer messages before waking up fully...

:)

C*heers,
-----------------------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-07-19 14:07 GMT+01:00 Alain RODRIGUEZ <arodr...@gmail.com>:

> Ah excuse my confusion. I now understand I guide you through changing the
> throughput when you wanted to change the compaction throughput.
>
> I also found some commands I ran in the past using jmxterm. As mentioned
> by Chris - and thanks Chris for answering the question properly -, the
> 'max' can never be lower than the 'core'.
>
> Use JMXTERM to REDUCE the concurrent compactors:
>
> ```
> # if we have more than 2 threads:
> echo "set -b org.apache.cassandra.db:type=CompactionManager
> CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager
> MaximumCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199
> ```
>
> Use JMXTERM to INCREASE the concurrent compactors:
>
> ```
> # if we have currently less than 6 threads:
> echo "set -b org.apache.cassandra.db:type=CompactionManager
> MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager
> CoreCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199
> ```
>
> Some comments about the information you shared, as you said, 'thinking out
> loud' :):
>
> *About the hardware*
>
> I remember using the 'm1.xlarge' :). They are not that recent. It will
> probably worth it to reconsider this hardware choice and migrate to newer
> hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to
> reduce the number of nodes and make it equivalent (or maybe slightly more
> expensive but so it works properly). I once moved from a lot of these nodes
> (80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from
> 20 ms to 3 - 5 ms (and was improved later on). Also, using the right
> hardware for your case should avoid headaches to you and your team. I
> started with t1.micro in prod and went all the way up (m1.small, m1.medium,
> ...). It's good for learning, not for business.
>
> Especially, this does not work well together:
>
> my instances are still on magnetic drivers
>>
>
> with
>
> most tables on LCS
>
> frequent r/w pattern
>>
>
> Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would most
> probably help to reduce the latency. I would also pick an instance with
> more memory (30 GB would probably be more comfortable). The more memory,
> the better it is possible to tune the JVM and the more page caching can be
> done (thus avoiding some disk reads). Given the number of nodes you use,
> it's complex to keep the cost low doing this change. When the cluster will
> grow you might want to consider changing the instance type again and maybe
> for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB of
> memory and the same number of cpu (or more) and see how many nodes are
> needed. It might be slightly more expensive, but I really believe it could
> do  some good.
>
> As a middle term solution, I think you might be really happy with a change
> of this kind.
>
> *About DTCS/TWCS?*
>
>
>>
>> * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS*
>
> Indeed switching to DTCS rather than TWCS can be a real relief for a
> cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I
> must say I am not too sure for 3.0.x (x < 8) versions though. Maybe giving
> a try to http://thelastpickle.com/blog/2017/01/10/twcs-part2.html with
> https://github.com/jeffjirsa/twcs/tree/cassandra-3.0.0 is easier for you?
>
> *Garbage Collection?*
>
> That being said, the CPU load is really high, I suspect Garbage Collection
> is taking a lot of time to the nodes of this cluster. It is probably not
> helping the CPUs either. This might even be the biggest pain point for this
> cluster.
>
> Would you like to try using following settings on a canary node and see
> how it goes? These settings are quite arbitrary. With the gc.log I could be
> more precise on what I believe is a correct setting.
>
> GC Type: CMS
> Heap: 8 GB (could be bigger, but we are limited by the 15 GB in total).
> New_heap: 2 - 4 GB (maybe experiment with the 2 distinct values)
> TenuringThreshold: 15 (instead of 1, that is definitely too small and tend
> to have short living object still being promoted to the old gen)
>
> For those settings, I do not trust the cassandra defaults in most cases. 
> New_heap_size
> should be 25-50% of the heap (and not related to the number of CPU cores).
> Also below 16 GB I never had a better result with G1GC than CMS. But I must
> say I have been fighting a lot with CMS in the past to tune it nicely while
> I did not even play much with G1GC.
>
> This (or similar settings) worked for distinct cases having heavy read
> patterns. In the mailing list I explained recently to someone else my
> understanding of JVM and GC, also there is a blog post my colleague Jon
> wrote here: http://thelastpickle.com/blog/2018/04/11/gc-tuning.html. I
> believe he suggested a slightly different tuning.
> If none of this is helping, please send the gc.log file over with and
> without this change we could have a look what is going on. SurvivorRatio
> can also be moved down to 2 or 4, if you want to play around and check the
> difference.
>
> Make sure to use a canary node first, there is no 'good' configuration
> here, it really depends on the workload and the settings above could harm
> the cluster.
>
> I think we can make more of these instances. Nonetheless after adding a
> few more nodes, scaling up the instance type instead of the number of nodes
> to have SSDs and bit more of memory will make things smoother, and probably
> cheaper as well at some point.
>
>
>
>
> 2018-07-18 17:27 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>:
>
>> Chris,
>>
>> Thank you for mbean reference.
>>
>> On Wed, Jul 18, 2018 at 6:26 PM, Riccardo Ferrari <ferra...@gmail.com>
>> wrote:
>>
>>> Alain, thank you for email. I really really appreciate it!
>>>
>>> I am actually trying to remove the disk io from the suspect list, thus
>>> I'm want to reduce the number of concurrent compactors. I'll give
>>> thorughput a shot.
>>> No, I don't have a long list of pending compactions, however my
>>> instances are still on magnetic drivers and can't really afford high number
>>> of compactors.
>>>
>>> We started to have slow downs and most likely we were undersized, new
>>> features are coming in and I want to be ready for them.
>>> *About the issue:*
>>>
>>>
>>>    - High system load on cassanda nodes. This means top saying 6.0/12.0
>>>    on a 4 vcpu instance (!)
>>>
>>>
>>>    - CPU is high:
>>>          - Dynatrace says 50%
>>>          - top easily goes to 80%
>>>       - Network around 30Mb (according to Dynatrace)
>>>       - Disks:
>>>          - ~40 iops
>>>          - high latency: ~20ms (min 8 max 50!)
>>>          - negligible iowait
>>>          - testing an empty instance with fio I get 1200 r_iops / 400
>>>          w_iops
>>>
>>>
>>>    - Clients timeout
>>>       - mostly when reading
>>>       - few cases when writing
>>>    - Slowly growing number of "All time blocked of Native T-R"
>>>       - small numbers: hundreds vs millions of successfully serverd
>>>       requests
>>>
>>> The system:
>>>
>>>    - Cassandra 3.0.6
>>>       - most tables on LCS
>>>          - frequent r/w pattern
>>>       - few tables with DTCS
>>>          - need to upgrade to 3.0.8 for TWCS
>>>          - mostly TS data, stream write / batch read
>>>       - All our keyspaces have RF: 3
>>>
>>>
>>>    - All nodes on the same AZ
>>>    - m1.xlarge
>>>    - 4x420 drives (emphemerial storage) configured in striping (raid0)
>>>       - 4 vcpu
>>>       - 15GB ram
>>>    - workload:
>>>       - Java applications;
>>>          - Mostly feeding cassandra writing data coming in
>>>          - Apache Spark applications:
>>>          - batch processes to read and write back to C* or other systems
>>>          - not co-located
>>>
>>> So far my effort was put into growing the ring to better distribute the
>>> load and decrease the pressure, including:
>>>
>>>    - Increasing the node number from 3 to 5 (6th node joining)
>>>    - jvm memory "optimization":
>>>    - heaps were set by default script to something bit smaller that 4GB
>>>       with CMS gc
>>>       - gc pressure was high / long gc pauses
>>>          - clients were suffering of read timeouts
>>>       - increased the heap still using CMS:
>>>          - very long GC pauses
>>>          - not much tuning around CMS
>>>          - switched to G1 and forced 6/7GB heap on each node using
>>>       almost suggested settings
>>>       - much more stable
>>>             - generally < 300ms
>>>          - I still have long pauses from time to time (mostly around
>>>          1200ms, sometimes on some nodes 3000)
>>>
>>> *Thinking out loud:*
>>> Things are much better, however I still see a high cpu usage specially
>>> when Spark kicks even though spark jobs are very small in terms of
>>> resources (single worker with very limited parallelism).
>>>
>>> On LCS tables cfstats reports single digit read latencies and generally
>>> 0.X write latencies (as per today).
>>> On DTCS tables I have 0.x ms write latency but still double digit read
>>> latency, but I guess I should spend some time to tune that or upgrade and
>>> move away from DTCS :(
>>> Yes, Saprk reads mostly from DTCS tables
>>>
>>> Still is kinda common to to have dropped READ, HINT and MUTATION.
>>>
>>>    - not on all nodes
>>>    - this generally happen on node restart
>>>
>>>
>>> On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed
>>> 14.04 and 16.04) with terrible results, much slower instance startup and
>>> responsiveness, how could that be?
>>>
>>> Once everything will be stabilized I'll prepare our move to vpc and
>>> possibly upgrade to i3 instance, any comment on the hardware side?  is
>>> 4cores still a reasonble hardware?
>>>
>>> Best,
>>>
>>> On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ <arodr...@gmail.com>
>>> wrote:
>>>
>>>> Hello Riccardo,
>>>>
>>>> I noticed I have been writing a novel to answer a simple couple of
>>>> questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's
>>>> what you were looking for :). Also, there is a warning that it might be
>>>> counter-productive and stress the cluster even more to increase the
>>>> compaction throughput. There is more information below ('about the issue').
>>>>
>>>> *tl;dr*:
>>>>
>>>> What about using 'nodetool setcompactionthroughput XX' instead. It
>>>> should available there.
>>>>
>>>> In the same way 'nodetool getcompactionthroughput' gives you the
>>>> current value. Be aware that this change done through JMX/nodetool is
>>>> *not* permanent. You still need to update the cassandra.yaml file.
>>>>
>>>> If you really want to use the MBean through JMX, because using
>>>> 'nodetool' is too easy (or for any other reason :p):
>>>>
>>>> Mbean: org.apache.cassandra.service.StorageServiceMBean
>>>> Attribute: CompactionThroughputMbPerSec
>>>>
>>>> *Long story* with the "how to" since I went through this search
>>>> myself, I did not know where this MBean was.
>>>>
>>>> Can someone point me to the right mbean?
>>>>> I can not really find good docs about mbeans (or tools ...)
>>>>
>>>>
>>>> I am not sure about the doc, but you can use jmxterm (
>>>> http://wiki.cyclopsgroup.org/jmxterm/download.html).
>>>>
>>>> To replace the doc I use CCM (https://github.com/riptano/ccm) +
>>>> jconsole to find the mbeans locally:
>>>>
>>>> * Add loopback addresses for ccm (see the readme file)
>>>> * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3
>>>> -s'
>>>> * Start jconsole using the right pid: 'jconsole $(ccm node1 show |
>>>> grep pid | cut -d "=" -f 2)'
>>>> * Explore MBeans, try to guess where this could be (and discover other
>>>> funny stuff in there :)).
>>>>
>>>> I must admit I did not find it this way using C*3.0.6 and jconsole.
>>>> I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI
>>>> CompactionThroughput' with this result: https://gist.github.co
>>>> m/arodrime/f9591e4bdd2b1367a496447cdd959006
>>>>
>>>> With this I could find the right MBean, the only code documentation
>>>> that is always up to date is the code itself I am afraid:
>>>>
>>>> './src/java/org/apache/cassandra/service/StorageServiceMBean.java:
>>>> public void setCompactionThroughputMbPerSec(int value);'
>>>>
>>>> Note that the research in the code also leads to nodetool ;-).
>>>>
>>>> I could finally find the MBean in the 'jconsole' too:
>>>> https://cdn.pbrd.co/images/HuUya3x.png (not sure how long this link
>>>> will live).
>>>>
>>>> jconsole also allows you to see what attributes it is possible to set
>>>> or not.
>>>>
>>>> You can now find any other MBean you would need I hope :).
>>>>
>>>>
>>>> see if it helps when the system is under stress
>>>>
>>>>
>>>> *About the issue*
>>>>
>>>> You don't exactly say what you are observing, what is that "stress"?
>>>> How is it impacting the cluster?
>>>>
>>>> I ask because I am afraid this change might not help and even be
>>>> counter-productive. Even though having SSTables nicely compacted make a
>>>> huge difference at the read time, if that's already the case for you and
>>>> the data is already nicely compacted, doing this change won't help. It
>>>> might even make things slightly worse if the current bottleneck is the disk
>>>> IO during a stress period as the compactors would increase their disk read
>>>> throughput, thus maybe fight with the read requests for disk throughput.
>>>>
>>>> If you have a similar number of sstables on all nodes, not many
>>>> compactions pending (nodetool netstats -H) and read operations are hitting
>>>> a small number sstables (nodetool tablehistogram) then you probably
>>>> don't need to increase the compaction speed.
>>>>
>>>> Let's say that the compaction throughput is not often the cause of
>>>> stress during peak hours nor a direct way to make things 'faster'.
>>>> Generally when compaction goes wrong, the number of sstables goes *t*
>>>> *hrou**g**h* the roof. If you have a chart showing the number
>>>> sstables, you can see this really well.
>>>>
>>>> Of course, if you feel you are in this case, increasing the compaction
>>>> throughput will definitely help if the cluster also has spared disk
>>>> throughput.
>>>>
>>>> To check what's wrong, if you believe it's something different, here
>>>> are some useful commands:
>>>>
>>>> - nodetool tpstats (check for pending/blocked/dropped threads there)
>>>> - check WARN and ERRORS in the logs (ie. grep -e "WARN" -e "ERROR"
>>>> /var/log/cassandra/system.log)
>>>> - Check local latencies (nodetool tablestats / nodetool tablehistogram)
>>>> and compare it to the client request latency. At the node level, reads
>>>> should probably be a single digit in milliseconds, rather close to 1 ms
>>>> with SSDs and writes below the millisecond most probably (it depends on the
>>>> data size too, etc...).
>>>> - Trace a query during this period, see what takes time (for example
>>>> from  'cqlsh' - 'TRACING ON; SELECT ...')
>>>>
>>>> You can also analyze the *Garbage Collection* activity. As Cassandre
>>>> uses the JVM, a badly tuned GC will induce long pauses. Depending on the
>>>> workload, and I must say for most of the cluster I work on, default the
>>>> tuning is not that good and can keep server busy 10-15% of the time with
>>>> stop the world GC.
>>>> You might find this post of my colleague Jon about GC tuning for
>>>> Apache Cassandra interesting: http://thelastpic
>>>> kle.com/blog/2018/04/11/gc-tuning.html. GC pressure is a very common
>>>> way to optimize a Cassandra cluster, to adapt it to your workload/hardware.
>>>>
>>>> C*heers,
>>>> -----------------------
>>>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>>>> France / Spain
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>>
>>>> 2018-07-17 17:23 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>:
>>>>
>>>>> Hi list,
>>>>>
>>>>> Cassandra 3.0.6
>>>>>
>>>>> I'd like to test the change of concurrent compactors to see if it
>>>>> helps when the system is under stress.
>>>>>
>>>>> Can someone point me to the right mbean?
>>>>> I can not really find good docs about mbeans (or tools ...)
>>>>>
>>>>> Any suggestion much appreciated, best
>>>>>
>>>>
>>>>
>>>
>>
>

Re: concurrent_compactors via JMX

Reply via email to