Without looking at your GC logs (you never posted a gist), my assumption would be you're doing a lot of copying between survivor generations, and they're taking a long time. You're probably also copying a lot of data to your old gen as a result of having full-ish survivor spaces to begin with.
On Thu, Mar 3, 2016 at 10:26 PM Jeff Jirsa <jeff.ji...@crowdstrike.com> wrote: > I’d personally would have gone the other way – if you’re seeing parnew, > increasing new gen instead of decreasing it should help drop (faster) > rather than promoting to sv/oldgen (slower) ? > > > > From: Anishek Agarwal > Reply-To: "user@cassandra.apache.org" > Date: Thursday, March 3, 2016 at 8:55 PM > > To: "user@cassandra.apache.org" > Subject: Re: Lot of GC on two nodes out of 7 > > Hello, > > Bryan, most of the partition sizes are under 45 KB > > I have tried with concurrent_compactors : 8 for one of the nodes still no > improvement, > I have tried max_heap_Size : 8G, no improvement. > > I will try the newHeapsize of 2G though i am sure CMS will be a longer > then. > > Also doesn't look like i mentioned what type of GC was causing the > problems. On both the nodes its the ParNewGC thats taking long for each run > and too many runs are happening in succession. > > anishek > > > On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng <br...@blockcypher.com> wrote: > >> Hi Anishek, >> >> In addition to the good advice others have given, do you notice any >> abnormally large partitions? What does cfhistograms report for 99% >> partition size? A few huge partitions will cause very disproportionate load >> on your cluster, including high GC. >> >> --Bryan >> >> On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <amit.f.si...@ericsson.com> >> wrote: >> >>> Hi Anishek, >>> >>> >>> >>> We too faced similar problem in 2.0.14 and after doing some research we >>> config few parameters in Cassandra.yaml and was able to overcome GC pauses >>> . Those are : >>> >>> >>> >>> · memtable_flush_writers : increased from 1 to 3 as from >>> tpstats output we can see mutations dropped so it means writes are getting >>> blocked, so increasing number will have those catered. >>> >>> · memtable_total_space_in_mb : Default (1/4 of heap size), can >>> lowered because larger long lived objects will create pressure on HEAP, so >>> its better to reduce some amount of size. >>> >>> · Concurrent_compactors : Alain righlty pointed out this i.e >>> reduce it to 8. You need to try this. >>> >>> >>> >>> Also please check whether you have mutations drop in other nodes or not. >>> >>> >>> >>> Hope this helps in your cluster too. >>> >>> >>> >>> Regards >>> >>> Amit Singh >>> >>> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com] >>> *Sent:* Wednesday, March 02, 2016 9:33 PM >>> *To:* user@cassandra.apache.org >>> *Subject:* Re: Lot of GC on two nodes out of 7 >>> >>> >>> >>> Can you post a gist of the output of jstat -gccause (60 seconds worth)? >>> I think it's cool you're willing to experiment with alternative JVM >>> settings but I've never seen anyone use max tenuring threshold of 50 either >>> and I can't imagine it's helpful. Keep in mind if your objects are >>> actually reaching that threshold it means they've been copied 50x (really >>> really slow) and also you're going to end up spilling your eden objects >>> directly into your old gen if your survivor is full. Considering the small >>> amount of memory you're using for heap I'm really not surprised you're >>> running into problems. >>> >>> >>> >>> I recommend G1GC + 12GB heap and just let it optimize itself for almost >>> all cases with the latest JVM versions. >>> >>> >>> >>> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <arodr...@gmail.com> >>> wrote: >>> >>> It looks like you are doing a good work with this cluster and know a lot >>> about JVM, that's good :-). >>> >>> >>> >>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM >>> >>> >>> >>> That's good hardware too. >>> >>> >>> >>> With 64 GB of ram I would probably directly give a try to >>> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably. >>> >>> >>> >>> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using >>> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the >>> effects. But that's just an idea of something I would try to see the >>> impacts, I don't think it will solve your current issues or even make it >>> worse for this node. >>> >>> >>> >>> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would >>> allow you to store the memtables off-heap. Those are 2 improvements >>> reducing the heap pressure that you might be interested in. >>> >>> >>> >>> I have spent time reading about all other options before including them >>> and a similar configuration on our other prod cluster is showing good GC >>> graphs via gcviewer. >>> >>> >>> >>> So, let's look for an other reason. >>> >>> >>> >>> there are MUTATION and READ messages dropped in high number on nodes in >>> question and on other 5 nodes it varies between 1-3. >>> >>> >>> >>> - Is Memory, CPU or disk a bottleneck? Is one of those running at the >>> limits? >>> >>> >>> >>> concurrent_compactors: 48 >>> >>> >>> >>> Reducing this to 8 would free some space for transactions (R&W >>> requests). It is probably worth a try, even more when compaction is not >>> keeping up and compaction throughput is not throttled. >>> >>> >>> >>> Just found an issue about that: >>> https://issues.apache.org/jira/browse/CASSANDRA-7139 >>> >>> >>> >>> Looks like `concurrent_compactors: 8` is the new default. >>> >>> >>> >>> C*heers, >>> >>> ----------------------- >>> >>> Alain Rodriguez - al...@thelastpickle.com >>> >>> France >>> >>> >>> >>> The Last Pickle - Apache Cassandra Consulting >>> >>> http://www.thelastpickle.com >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal <anis...@gmail.com>: >>> >>> Thanks a lot Alian for the details. >>> >>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G) >>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You >>> might want to keep this as it or even reduce it if you have less than 16 GB >>> of native memory. Go with 8 GB if you have a lot of memory. >>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so >>> far. I had luck with values between 4 <--> 16 in the past. I would give a >>> try with 15. >>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? >>> Using default and then tune from there to improve things is generally a >>> good idea. >>> >>> >>> >>> >>> >>> we have a lot of reads and writes onto the system so keeping the high >>> new size to make sure enough is held in memory including caches / memtables >>> etc --number of flush_writers : 4 for us. similarly keeping less in old >>> generation to make sure we spend less time with CMS GC most of the data is >>> transient in memory for us. Keeping high TenuringThreshold because we don't >>> want objects going to old generation and just die in young generation given >>> we have configured large survivor spaces. >>> >>> using occupancyFraction as 70 since >>> >>> given heap is 4G >>> >>> survivor space is : 400 mb -- 2 survivor spaces >>> >>> 70 % of 2G (old generation) = 1.4G >>> >>> >>> >>> so once we are just below 1.4G and we have to move the full survivor + >>> some extra during a par new gc due to promotion failure, everything will >>> fit in old generation, and will trigger CMS. >>> >>> >>> >>> I have spent time reading about all other options before including them >>> and a similar configuration on our other prod cluster is showing good GC >>> graphs via gcviewer. >>> >>> >>> >>> tp stats on all machines show flush writer blocked at : 0.3% of total >>> >>> >>> >>> the two nodes in question have stats almost as below >>> >>> - specifically there are pending was in readStage, MutationStage and >>> RequestResponseStage >>> >>> Pool Name Active Pending Completed Blocked >>> All time blocked >>> >>> ReadStage 21 19 2141798645 0 >>> 0 >>> >>> RequestResponseStage 0 1 803242391 0 >>> 0 >>> >>> MutationStage 0 0 291813703 0 >>> 0 >>> >>> ReadRepairStage 0 0 200544344 0 >>> 0 >>> >>> ReplicateOnWriteStage 0 0 0 0 >>> 0 >>> >>> GossipStage 0 0 292477 0 >>> 0 >>> >>> CacheCleanupExecutor 0 0 0 0 >>> 0 >>> >>> MigrationStage 0 0 0 0 >>> 0 >>> >>> MemoryMeter 0 0 2172 0 >>> 0 >>> >>> FlushWriter 0 0 2756 0 >>> 6 >>> >>> ValidationExecutor 0 0 101 0 >>> 0 >>> >>> InternalResponseStage 0 0 0 0 >>> 0 >>> >>> AntiEntropyStage 0 0 202 0 >>> 0 >>> >>> MemtablePostFlusher 0 0 4395 0 >>> 0 >>> >>> MiscStage 0 0 0 0 >>> 0 >>> >>> PendingRangeCalculator 0 0 20 0 >>> 0 >>> >>> CompactionExecutor 4 4 49323 0 >>> 0 >>> >>> commitlog_archiver 0 0 0 0 >>> 0 >>> >>> HintedHandoff 0 0 116 0 >>> 0 >>> >>> >>> >>> Message type Dropped >>> >>> RANGE_SLICE 0 >>> >>> READ_REPAIR 36 >>> >>> PAGED_RANGE 0 >>> >>> BINARY 0 >>> >>> READ 11471 >>> >>> MUTATION 898 >>> >>> _TRACE 0 >>> >>> REQUEST_RESPONSE 0 >>> >>> COUNTER_MUTATION 0 >>> >>> >>> >>> all the other 5 nodes show no pending numbers. >>> >>> >>> >>> >>> >>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM >>> >>> compaction throughput is 0 MB/s >>> >>> concurrent_compactors: 48 >>> >>> flush_writers: 4 >>> >>> >>> >>> >>> I think Jeff is trying to spot a wide row messing with your system, so >>> looking at the max row size on those nodes compared to other is more >>> relevant than average size for this check. >>> >>> >>> >>> i think is what you are looking for, please correct me if i am wrong >>> >>> Compacted partition maximum bytes: 1629722 >>> >>> similar value on all 7 nodes. >>> >>> >>> >>> grep -i "ERROR" /var/log/cassandra/system.log >>> >>> >>> >>> there are MUTATION and READ messages dropped in high number on nodes in >>> question and on other 5 nodes it varies between 1-3. >>> >>> >>> >>> On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <arodr...@gmail.com> >>> wrote: >>> >>> Hi Anishek, >>> >>> >>> >>> Even if it highly depends on your workload, here are my thoughts: >>> >>> >>> >>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G) >>> >>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You >>> might want to keep this as it or even reduce it if you have less than 16 GB >>> of native memory. Go with 8 GB if you have a lot of memory. >>> >>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so >>> far. I had luck with values between 4 <--> 16 in the past. I would give a >>> try with 15. >>> >>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? >>> Using default and then tune from there to improve things is generally a >>> good idea. >>> >>> >>> >>> You also use a bunch of option I don't know about, if you are uncertain >>> about them, you could try a default conf without the options you added and >>> just the using the changes above from default >>> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh. >>> Or you might find more useful information on a nice reference about this >>> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java >>> Virtual Machine' part: >>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html >>> >>> >>> >>> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and >>> G1GC. G1GC is supposed to be easier to configure too. >>> >>> >>> >>> the average row size for compacted partitions is about 1640 bytes on all >>> nodes. We have replication factor 3 but the problem is only on two nodes. >>> >>> >>> >>> I think Jeff is trying to spot a wide row messing with your system, so >>> looking at the max row size on those nodes compared to other is more >>> relevant than average size for this check. >>> >>> >>> >>> the only other thing that stands out in cfstats is the read time and >>> write time on the nodes with high GC is 5-7 times higher than other 5 >>> nodes, but i think thats expected. >>> >>> >>> >>> I would probably look at this the reverse way: I imagine that extra GC >>> is a consequence of something going wrong on those nodes as JVM / GC are >>> configured the same way cluster-wide. GC / JVM issues are often due to >>> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I >>> would try to tune JVM / GC only once the system is healthy. So I often saw >>> high GC being a consequence rather than the root cause of an issue. >>> >>> >>> >>> To explore this possibility: >>> >>> >>> >>> Does this command show some dropped or blocked tasks? This would add >>> pressure to heap. >>> >>> nodetool tpstats >>> >>> >>> >>> Do you have errors in logs? Always good to know when facing an issue. >>> >>> grep -i "ERROR" /var/log/cassandra/system.log >>> >>> >>> >>> How are compactions tuned (throughput + concurrent compactors)? This >>> tuning might explain compactions not keeping up or a high GC pressure. >>> >>> >>> >>> What are your disks / CPU? To help us giving you good arbitrary values >>> to try. >>> >>> >>> >>> Is there some iowait ? Could point to a bottleneck or bad hardware. >>> >>> iostats -mx 5 100 >>> >>> >>> >>> ... >>> >>> >>> >>> Hope one of those will point you to an issue, but there are many more >>> thing you could check. >>> >>> >>> >>> Let us know how it goes, >>> >>> >>> >>> C*heers, >>> >>> ----------------------- >>> >>> Alain Rodriguez - al...@thelastpickle.com >>> >>> France >>> >>> >>> >>> The Last Pickle - Apache Cassandra Consulting >>> >>> http://www.thelastpickle.com >>> >>> >>> >>> >>> >>> >>> >>> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <anis...@gmail.com>: >>> >>> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G. >>> >>> >>> >>> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <anis...@gmail.com> >>> wrote: >>> >>> Hey Jeff, >>> >>> >>> >>> one of the nodes with high GC has 1400 SST tables, all other nodes have >>> about 500-900 SST tables. the other node with high GC has 636 SST tables. >>> >>> >>> >>> the average row size for compacted partitions is about 1640 bytes on all >>> nodes. We have replication factor 3 but the problem is only on two nodes. >>> >>> the only other thing that stands out in cfstats is the read time and >>> write time on the nodes with high GC is 5-7 times higher than other 5 >>> nodes, but i think thats expected. >>> >>> >>> >>> thanks >>> >>> anishek >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> >>> wrote: >>> >>> Compaction falling behind will likely cause additional work on reads >>> (more sstables to merge), but I’d be surprised if it manifested in super >>> long GC. When you say twice as many sstables, how many is that?. >>> >>> >>> >>> In cfstats, does anything stand out? Is max row size on those nodes >>> larger than on other nodes? >>> >>> >>> >>> What you don’t show in your JVM options is the new gen size – if you do >>> have unusually large partitions on those two nodes (especially likely if >>> you have rf=2 – if you have rf=3, then there’s probably a third node >>> misbehaving you haven’t found yet), then raising new gen size can help >>> handle the garbage created by reading large partitions without having to >>> tolerate the promotion. Estimates for the amount of garbage vary, but it >>> could be “gigabytes” of garbage on a very wide partition (see >>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in >>> progress to help mitigate that type of pain). >>> >>> >>> >>> - Jeff >>> >>> >>> >>> *From: *Anishek Agarwal >>> *Reply-To: *"user@cassandra.apache.org" >>> *Date: *Tuesday, March 1, 2016 at 11:12 PM >>> *To: *"user@cassandra.apache.org" >>> *Subject: *Lot of GC on two nodes out of 7 >>> >>> >>> >>> Hello, >>> >>> >>> >>> we have a cassandra cluster of 7 nodes, all of them have the same JVM GC >>> configurations, all our writes / reads use the TokenAware Policy wrapping >>> a DCAware policy. All nodes are part of same Datacenter. >>> >>> >>> >>> We are seeing that two nodes are having high GC collection times. Then >>> mostly seem to spend time in GC like about 300-600 ms. This also seems to >>> result in higher CPU utilisation on these machines. Other 5 nodes don't >>> have this problem. >>> >>> >>> >>> There is no additional repair activity going on the cluster, we are not >>> sure why this is happening. >>> >>> we checked cfhistograms on the two CF we have in the cluster and number >>> of reads seems to be almost same. >>> >>> >>> >>> we also used cfstats to see the number of ssttables on each node and one >>> of the nodes with the above problem has twice the number of ssttables than >>> other nodes. This still doesnot explain why two nodes have high GC >>> Overheads. our GC config is as below: >>> >>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" >>> >>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" >>> >>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" >>> >>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" >>> >>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50" >>> >>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" >>> >>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" >>> >>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" >>> >>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m" >>> >>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts" >>> >>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops" >>> >>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" >>> >>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48" >>> >>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48" >>> >>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent" >>> >>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" >>> >>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" >>> >>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" >>> >>> # earlier value 131072 = 32768 * 4 >>> >>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072" >>> >>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600" >>> >>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768" >>> >>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768" >>> >>> #new >>> >>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled" >>> >>> >>> >>> We are using cassandra 2.0.17. If anyone has any suggestion as to how >>> what else we can look for to understand why this is happening please do >>> reply. >>> >>> >>> >>> >>> >>> >>> >>> Thanks >>> >>> anishek >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >