It looks like you are doing a good work with this cluster and know a lot about JVM, that's good :-).
our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM That's good hardware too. With 64 GB of ram I would probably directly give a try to `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably. Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the effects. But that's just an idea of something I would try to see the impacts, I don't think it will solve your current issues or even make it worse for this node. Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would allow you to store the memtables off-heap. Those are 2 improvements reducing the heap pressure that you might be interested in. I have spent time reading about all other options before including them and > a similar configuration on our other prod cluster is showing good GC graphs > via gcviewer. So, let's look for an other reason. there are MUTATION and READ messages dropped in high number on nodes in > question and on other 5 nodes it varies between 1-3. - Is Memory, CPU or disk a bottleneck? Is one of those running at the limits? concurrent_compactors: 48 Reducing this to 8 would free some space for transactions (R&W requests). It is probably worth a try, even more when compaction is not keeping up and compaction throughput is not throttled. Just found an issue about that: https://issues.apache.org/jira/browse/CASSANDRA-7139 Looks like `concurrent_compactors: 8` is the new default. C*heers, ----------------------- Alain Rodriguez - al...@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-03-02 12:27 GMT+01:00 Anishek Agarwal <anis...@gmail.com>: > Thanks a lot Alian for the details. > >> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G) >> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You >> might want to keep this as it or even reduce it if you have less than 16 GB >> of native memory. Go with 8 GB if you have a lot of memory. >> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so >> far. I had luck with values between 4 <--> 16 in the past. I would give a >> try with 15. >> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? >> Using default and then tune from there to improve things is generally a >> good idea. > > > > we have a lot of reads and writes onto the system so keeping the high new > size to make sure enough is held in memory including caches / memtables etc > --number of flush_writers : 4 for us. similarly keeping less in old > generation to make sure we spend less time with CMS GC most of the data is > transient in memory for us. Keeping high TenuringThreshold because we don't > want objects going to old generation and just die in young generation given > we have configured large survivor spaces. > using occupancyFraction as 70 since > given heap is 4G > survivor space is : 400 mb -- 2 survivor spaces > 70 % of 2G (old generation) = 1.4G > > so once we are just below 1.4G and we have to move the full survivor + > some extra during a par new gc due to promotion failure, everything will > fit in old generation, and will trigger CMS. > > I have spent time reading about all other options before including them > and a similar configuration on our other prod cluster is showing good GC > graphs via gcviewer. > > tp stats on all machines show flush writer blocked at : 0.3% of total > > the two nodes in question have stats almost as below > > - specifically there are pending was in readStage, MutationStage and > RequestResponseStage > > Pool Name Active Pending Completed Blocked > All time blocked > > ReadStage 21 19 2141798645 0 > 0 > > RequestResponseStage 0 1 803242391 0 > 0 > > MutationStage 0 0 291813703 0 > 0 > > ReadRepairStage 0 0 200544344 0 > 0 > > ReplicateOnWriteStage 0 0 0 0 > 0 > > GossipStage 0 0 292477 0 > 0 > > CacheCleanupExecutor 0 0 0 0 > 0 > > MigrationStage 0 0 0 0 > 0 > > MemoryMeter 0 0 2172 0 > 0 > > FlushWriter 0 0 2756 0 > 6 > > ValidationExecutor 0 0 101 0 > 0 > > InternalResponseStage 0 0 0 0 > 0 > > AntiEntropyStage 0 0 202 0 > 0 > > MemtablePostFlusher 0 0 4395 0 > 0 > > MiscStage 0 0 0 0 > 0 > > PendingRangeCalculator 0 0 20 0 > 0 > > CompactionExecutor 4 4 49323 0 > 0 > > commitlog_archiver 0 0 0 0 > 0 > > HintedHandoff 0 0 116 0 > 0 > > > Message type Dropped > > RANGE_SLICE 0 > > READ_REPAIR 36 > > PAGED_RANGE 0 > > BINARY 0 > > READ 11471 > > MUTATION 898 > > _TRACE 0 > > REQUEST_RESPONSE 0 > > COUNTER_MUTATION 0 > > all the other 5 nodes show no pending numbers. > > > our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM > compaction throughput is 0 MB/s > concurrent_compactors: 48 > flush_writers: 4 > > >> I think Jeff is trying to spot a wide row messing with your system, so >> looking at the max row size on those nodes compared to other is more >> relevant than average size for this check. > > > i think is what you are looking for, please correct me if i am wrong > > Compacted partition maximum bytes: 1629722 > similar value on all 7 nodes. > > grep -i "ERROR" /var/log/cassandra/system.log > > > there are MUTATION and READ messages dropped in high number on nodes in > question and on other 5 nodes it varies between 1-3. > > On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <arodr...@gmail.com> > wrote: > >> Hi Anishek, >> >> Even if it highly depends on your workload, here are my thoughts: >> >> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G) >> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You >> might want to keep this as it or even reduce it if you have less than 16 GB >> of native memory. Go with 8 GB if you have a lot of memory. >> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so >> far. I had luck with values between 4 <--> 16 in the past. I would give a >> try with 15. >> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? >> Using default and then tune from there to improve things is generally a >> good idea. >> >> You also use a bunch of option I don't know about, if you are uncertain >> about them, you could try a default conf without the options you added and >> just the using the changes above from default >> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh. >> Or you might find more useful information on a nice reference about this >> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java >> Virtual Machine' part: >> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html >> >> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and >> G1GC. G1GC is supposed to be easier to configure too. >> >> the average row size for compacted partitions is about 1640 bytes on all >>> nodes. We have replication factor 3 but the problem is only on two nodes. >>> >> >> I think Jeff is trying to spot a wide row messing with your system, so >> looking at the max row size on those nodes compared to other is more >> relevant than average size for this check. >> >> the only other thing that stands out in cfstats is the read time and >>> write time on the nodes with high GC is 5-7 times higher than other 5 >>> nodes, but i think thats expected. >> >> >> I would probably look at this the reverse way: I imagine that extra GC >> is a consequence of something going wrong on those nodes as JVM / GC are >> configured the same way cluster-wide. GC / JVM issues are often due to >> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I >> would try to tune JVM / GC only once the system is healthy. So I often saw >> high GC being a consequence rather than the root cause of an issue. >> >> To explore this possibility: >> >> Does this command show some dropped or blocked tasks? This would add >> pressure to heap. >> nodetool tpstats >> >> Do you have errors in logs? Always good to know when facing an issue. >> grep -i "ERROR" /var/log/cassandra/system.log >> >> How are compactions tuned (throughput + concurrent compactors)? This >> tuning might explain compactions not keeping up or a high GC pressure. >> >> What are your disks / CPU? To help us giving you good arbitrary values to >> try. >> >> Is there some iowait ? Could point to a bottleneck or bad hardware. >> iostats -mx 5 100 >> >> ... >> >> Hope one of those will point you to an issue, but there are many more >> thing you could check. >> >> Let us know how it goes, >> >> C*heers, >> ----------------------- >> Alain Rodriguez - al...@thelastpickle.com >> France >> >> The Last Pickle - Apache Cassandra Consulting >> http://www.thelastpickle.com >> >> >> >> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <anis...@gmail.com>: >> >>> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G. >>> >>> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <anis...@gmail.com> >>> wrote: >>> >>>> Hey Jeff, >>>> >>>> one of the nodes with high GC has 1400 SST tables, all other nodes have >>>> about 500-900 SST tables. the other node with high GC has 636 SST tables. >>>> >>>> the average row size for compacted partitions is about 1640 bytes on >>>> all nodes. We have replication factor 3 but the problem is only on two >>>> nodes. >>>> the only other thing that stands out in cfstats is the read time and >>>> write time on the nodes with high GC is 5-7 times higher than other 5 >>>> nodes, but i think thats expected. >>>> >>>> thanks >>>> anishek >>>> >>>> >>>> >>>> >>>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> >>>> wrote: >>>> >>>>> Compaction falling behind will likely cause additional work on reads >>>>> (more sstables to merge), but I’d be surprised if it manifested in super >>>>> long GC. When you say twice as many sstables, how many is that?. >>>>> >>>>> In cfstats, does anything stand out? Is max row size on those nodes >>>>> larger than on other nodes? >>>>> >>>>> What you don’t show in your JVM options is the new gen size – if you >>>>> do have unusually large partitions on those two nodes (especially likely >>>>> if >>>>> you have rf=2 – if you have rf=3, then there’s probably a third node >>>>> misbehaving you haven’t found yet), then raising new gen size can help >>>>> handle the garbage created by reading large partitions without having to >>>>> tolerate the promotion. Estimates for the amount of garbage vary, but it >>>>> could be “gigabytes” of garbage on a very wide partition (see >>>>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in >>>>> progress to help mitigate that type of pain). >>>>> >>>>> - Jeff >>>>> >>>>> From: Anishek Agarwal >>>>> Reply-To: "user@cassandra.apache.org" >>>>> Date: Tuesday, March 1, 2016 at 11:12 PM >>>>> To: "user@cassandra.apache.org" >>>>> Subject: Lot of GC on two nodes out of 7 >>>>> >>>>> Hello, >>>>> >>>>> we have a cassandra cluster of 7 nodes, all of them have the same JVM >>>>> GC configurations, all our writes / reads use the TokenAware Policy >>>>> wrapping a DCAware policy. All nodes are part of same Datacenter. >>>>> >>>>> We are seeing that two nodes are having high GC collection times. Then >>>>> mostly seem to spend time in GC like about 300-600 ms. This also seems to >>>>> result in higher CPU utilisation on these machines. Other 5 nodes don't >>>>> have this problem. >>>>> >>>>> There is no additional repair activity going on the cluster, we are >>>>> not sure why this is happening. >>>>> we checked cfhistograms on the two CF we have in the cluster and >>>>> number of reads seems to be almost same. >>>>> >>>>> we also used cfstats to see the number of ssttables on each node and >>>>> one of the nodes with the above problem has twice the number of ssttables >>>>> than other nodes. This still doesnot explain why two nodes have high GC >>>>> Overheads. our GC config is as below: >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" >>>>> >>>>> # earlier value 131072 = 32768 * 4 >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768" >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768" >>>>> >>>>> #new >>>>> >>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled" >>>>> >>>>> We are using cassandra 2.0.17. If anyone has any suggestion as to how >>>>> what else we can look for to understand why this is happening please do >>>>> reply. >>>>> >>>>> >>>>> >>>>> Thanks >>>>> anishek >>>>> >>>>> >>>>> >>>> >>> >> >