Hi Anishek, In addition to the good advice others have given, do you notice any abnormally large partitions? What does cfhistograms report for 99% partition size? A few huge partitions will cause very disproportionate load on your cluster, including high GC.
--Bryan On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <amit.f.si...@ericsson.com> wrote: > Hi Anishek, > > > > We too faced similar problem in 2.0.14 and after doing some research we > config few parameters in Cassandra.yaml and was able to overcome GC pauses > . Those are : > > > > · memtable_flush_writers : increased from 1 to 3 as from tpstats > output we can see mutations dropped so it means writes are getting > blocked, so increasing number will have those catered. > > · memtable_total_space_in_mb : Default (1/4 of heap size), can > lowered because larger long lived objects will create pressure on HEAP, so > its better to reduce some amount of size. > > · Concurrent_compactors : Alain righlty pointed out this i.e > reduce it to 8. You need to try this. > > > > Also please check whether you have mutations drop in other nodes or not. > > > > Hope this helps in your cluster too. > > > > Regards > > Amit Singh > > *From:* Jonathan Haddad [mailto:j...@jonhaddad.com] > *Sent:* Wednesday, March 02, 2016 9:33 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Lot of GC on two nodes out of 7 > > > > Can you post a gist of the output of jstat -gccause (60 seconds worth)? I > think it's cool you're willing to experiment with alternative JVM settings > but I've never seen anyone use max tenuring threshold of 50 either and I > can't imagine it's helpful. Keep in mind if your objects are actually > reaching that threshold it means they've been copied 50x (really really > slow) and also you're going to end up spilling your eden objects directly > into your old gen if your survivor is full. Considering the small amount > of memory you're using for heap I'm really not surprised you're running > into problems. > > > > I recommend G1GC + 12GB heap and just let it optimize itself for almost > all cases with the latest JVM versions. > > > > On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <arodr...@gmail.com> wrote: > > It looks like you are doing a good work with this cluster and know a lot > about JVM, that's good :-). > > > > our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM > > > > That's good hardware too. > > > > With 64 GB of ram I would probably directly give a try to > `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably. > > > > Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using > `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the > effects. But that's just an idea of something I would try to see the > impacts, I don't think it will solve your current issues or even make it > worse for this node. > > > > Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would > allow you to store the memtables off-heap. Those are 2 improvements > reducing the heap pressure that you might be interested in. > > > > I have spent time reading about all other options before including them > and a similar configuration on our other prod cluster is showing good GC > graphs via gcviewer. > > > > So, let's look for an other reason. > > > > there are MUTATION and READ messages dropped in high number on nodes in > question and on other 5 nodes it varies between 1-3. > > > > - Is Memory, CPU or disk a bottleneck? Is one of those running at the > limits? > > > > concurrent_compactors: 48 > > > > Reducing this to 8 would free some space for transactions (R&W requests). > It is probably worth a try, even more when compaction is not keeping up and > compaction throughput is not throttled. > > > > Just found an issue about that: > https://issues.apache.org/jira/browse/CASSANDRA-7139 > > > > Looks like `concurrent_compactors: 8` is the new default. > > > > C*heers, > > ----------------------- > > Alain Rodriguez - al...@thelastpickle.com > > France > > > > The Last Pickle - Apache Cassandra Consulting > > http://www.thelastpickle.com > > > > > > > > > > > > > > 2016-03-02 12:27 GMT+01:00 Anishek Agarwal <anis...@gmail.com>: > > Thanks a lot Alian for the details. > > `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G) > `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You > might want to keep this as it or even reduce it if you have less than 16 GB > of native memory. Go with 8 GB if you have a lot of memory. > `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so > far. I had luck with values between 4 <--> 16 in the past. I would give a > try with 15. > `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? > Using default and then tune from there to improve things is generally a > good idea. > > > > > > we have a lot of reads and writes onto the system so keeping the high new > size to make sure enough is held in memory including caches / memtables etc > --number of flush_writers : 4 for us. similarly keeping less in old > generation to make sure we spend less time with CMS GC most of the data is > transient in memory for us. Keeping high TenuringThreshold because we don't > want objects going to old generation and just die in young generation given > we have configured large survivor spaces. > > using occupancyFraction as 70 since > > given heap is 4G > > survivor space is : 400 mb -- 2 survivor spaces > > 70 % of 2G (old generation) = 1.4G > > > > so once we are just below 1.4G and we have to move the full survivor + > some extra during a par new gc due to promotion failure, everything will > fit in old generation, and will trigger CMS. > > > > I have spent time reading about all other options before including them > and a similar configuration on our other prod cluster is showing good GC > graphs via gcviewer. > > > > tp stats on all machines show flush writer blocked at : 0.3% of total > > > > the two nodes in question have stats almost as below > > - specifically there are pending was in readStage, MutationStage and > RequestResponseStage > > Pool Name Active Pending Completed Blocked > All time blocked > > ReadStage 21 19 2141798645 0 > 0 > > RequestResponseStage 0 1 803242391 0 > 0 > > MutationStage 0 0 291813703 0 > 0 > > ReadRepairStage 0 0 200544344 0 > 0 > > ReplicateOnWriteStage 0 0 0 0 > 0 > > GossipStage 0 0 292477 0 > 0 > > CacheCleanupExecutor 0 0 0 0 > 0 > > MigrationStage 0 0 0 0 > 0 > > MemoryMeter 0 0 2172 0 > 0 > > FlushWriter 0 0 2756 0 > 6 > > ValidationExecutor 0 0 101 0 > 0 > > InternalResponseStage 0 0 0 0 > 0 > > AntiEntropyStage 0 0 202 0 > 0 > > MemtablePostFlusher 0 0 4395 0 > 0 > > MiscStage 0 0 0 0 > 0 > > PendingRangeCalculator 0 0 20 0 > 0 > > CompactionExecutor 4 4 49323 0 > 0 > > commitlog_archiver 0 0 0 0 > 0 > > HintedHandoff 0 0 116 0 > 0 > > > > Message type Dropped > > RANGE_SLICE 0 > > READ_REPAIR 36 > > PAGED_RANGE 0 > > BINARY 0 > > READ 11471 > > MUTATION 898 > > _TRACE 0 > > REQUEST_RESPONSE 0 > > COUNTER_MUTATION 0 > > > > all the other 5 nodes show no pending numbers. > > > > > > our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM > > compaction throughput is 0 MB/s > > concurrent_compactors: 48 > > flush_writers: 4 > > > > > I think Jeff is trying to spot a wide row messing with your system, so > looking at the max row size on those nodes compared to other is more > relevant than average size for this check. > > > > i think is what you are looking for, please correct me if i am wrong > > Compacted partition maximum bytes: 1629722 > > similar value on all 7 nodes. > > > > grep -i "ERROR" /var/log/cassandra/system.log > > > > there are MUTATION and READ messages dropped in high number on nodes in > question and on other 5 nodes it varies between 1-3. > > > > On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <arodr...@gmail.com> > wrote: > > Hi Anishek, > > > > Even if it highly depends on your workload, here are my thoughts: > > > > `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G) > > `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You > might want to keep this as it or even reduce it if you have less than 16 GB > of native memory. Go with 8 GB if you have a lot of memory. > > `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so > far. I had luck with values between 4 <--> 16 in the past. I would give a > try with 15. > > `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? > Using default and then tune from there to improve things is generally a > good idea. > > > > You also use a bunch of option I don't know about, if you are uncertain > about them, you could try a default conf without the options you added and > just the using the changes above from default > https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh. > Or you might find more useful information on a nice reference about this > topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java > Virtual Machine' part: > https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html > > > > FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and > G1GC. G1GC is supposed to be easier to configure too. > > > > the average row size for compacted partitions is about 1640 bytes on all > nodes. We have replication factor 3 but the problem is only on two nodes. > > > > I think Jeff is trying to spot a wide row messing with your system, so > looking at the max row size on those nodes compared to other is more > relevant than average size for this check. > > > > the only other thing that stands out in cfstats is the read time and write > time on the nodes with high GC is 5-7 times higher than other 5 nodes, but > i think thats expected. > > > > I would probably look at this the reverse way: I imagine that extra GC is > a consequence of something going wrong on those nodes as JVM / GC are > configured the same way cluster-wide. GC / JVM issues are often due to > Cassandra / system / hardware issues, inducing extra pressure on the JVM. I > would try to tune JVM / GC only once the system is healthy. So I often saw > high GC being a consequence rather than the root cause of an issue. > > > > To explore this possibility: > > > > Does this command show some dropped or blocked tasks? This would add > pressure to heap. > > nodetool tpstats > > > > Do you have errors in logs? Always good to know when facing an issue. > > grep -i "ERROR" /var/log/cassandra/system.log > > > > How are compactions tuned (throughput + concurrent compactors)? This > tuning might explain compactions not keeping up or a high GC pressure. > > > > What are your disks / CPU? To help us giving you good arbitrary values to > try. > > > > Is there some iowait ? Could point to a bottleneck or bad hardware. > > iostats -mx 5 100 > > > > ... > > > > Hope one of those will point you to an issue, but there are many more > thing you could check. > > > > Let us know how it goes, > > > > C*heers, > > ----------------------- > > Alain Rodriguez - al...@thelastpickle.com > > France > > > > The Last Pickle - Apache Cassandra Consulting > > http://www.thelastpickle.com > > > > > > > > 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <anis...@gmail.com>: > > also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G. > > > > On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <anis...@gmail.com> wrote: > > Hey Jeff, > > > > one of the nodes with high GC has 1400 SST tables, all other nodes have > about 500-900 SST tables. the other node with high GC has 636 SST tables. > > > > the average row size for compacted partitions is about 1640 bytes on all > nodes. We have replication factor 3 but the problem is only on two nodes. > > the only other thing that stands out in cfstats is the read time and write > time on the nodes with high GC is 5-7 times higher than other 5 nodes, but > i think thats expected. > > > > thanks > > anishek > > > > > > > > > > On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> > wrote: > > Compaction falling behind will likely cause additional work on reads (more > sstables to merge), but I’d be surprised if it manifested in super long GC. > When you say twice as many sstables, how many is that?. > > > > In cfstats, does anything stand out? Is max row size on those nodes larger > than on other nodes? > > > > What you don’t show in your JVM options is the new gen size – if you do > have unusually large partitions on those two nodes (especially likely if > you have rf=2 – if you have rf=3, then there’s probably a third node > misbehaving you haven’t found yet), then raising new gen size can help > handle the garbage created by reading large partitions without having to > tolerate the promotion. Estimates for the amount of garbage vary, but it > could be “gigabytes” of garbage on a very wide partition (see > https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress > to help mitigate that type of pain). > > > > - Jeff > > > > *From: *Anishek Agarwal > *Reply-To: *"user@cassandra.apache.org" > *Date: *Tuesday, March 1, 2016 at 11:12 PM > *To: *"user@cassandra.apache.org" > *Subject: *Lot of GC on two nodes out of 7 > > > > Hello, > > > > we have a cassandra cluster of 7 nodes, all of them have the same JVM GC > configurations, all our writes / reads use the TokenAware Policy wrapping > a DCAware policy. All nodes are part of same Datacenter. > > > > We are seeing that two nodes are having high GC collection times. Then > mostly seem to spend time in GC like about 300-600 ms. This also seems to > result in higher CPU utilisation on these machines. Other 5 nodes don't > have this problem. > > > > There is no additional repair activity going on the cluster, we are not > sure why this is happening. > > we checked cfhistograms on the two CF we have in the cluster and number of > reads seems to be almost same. > > > > we also used cfstats to see the number of ssttables on each node and one > of the nodes with the above problem has twice the number of ssttables than > other nodes. This still doesnot explain why two nodes have high GC > Overheads. our GC config is as below: > > JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" > > JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" > > JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" > > JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" > > JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50" > > JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" > > JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" > > JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" > > JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m" > > JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts" > > JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops" > > JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" > > JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48" > > JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48" > > JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent" > > JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" > > JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" > > JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" > > # earlier value 131072 = 32768 * 4 > > JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072" > > JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600" > > JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768" > > JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768" > > #new > > JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled" > > > > We are using cassandra 2.0.17. If anyone has any suggestion as to how what > else we can look for to understand why this is happening please do reply. > > > > > > > > Thanks > > anishek > > > > > > > > > > > > > > > >