Re: Lot of GC on two nodes out of 7

Jonathan Haddad Fri, 04 Mar 2016 11:25:47 -0800

Without looking at your GC logs (you never posted a gist), my assumption
would be you're doing a lot of copying between survivor generations, and
they're taking a long time.  You're probably also copying a lot of data to
your old gen as a result of having full-ish survivor spaces to begin with.


On Thu, Mar 3, 2016 at 10:26 PM Jeff Jirsa <jeff.ji...@crowdstrike.com>
wrote:

> I’d personally would have gone the other way – if you’re seeing parnew,
> increasing new gen instead of decreasing it should help drop (faster)
> rather than promoting to sv/oldgen (slower) ?
>
>
>
> From: Anishek Agarwal
> Reply-To: "user@cassandra.apache.org"
> Date: Thursday, March 3, 2016 at 8:55 PM
>
> To: "user@cassandra.apache.org"
> Subject: Re: Lot of GC on two nodes out of 7
>
> Hello,
>
> Bryan, most of the partition sizes are under 45 KB
>
> I have tried with concurrent_compactors : 8 for one of the nodes still no
> improvement,
> I have tried max_heap_Size : 8G, no improvement.
>
> I will try the newHeapsize of 2G though i am sure CMS will be a longer
> then.
>
> Also doesn't look like i mentioned what type of GC was causing the
> problems. On both the nodes its the ParNewGC thats taking long for each run
> and too many runs are happening in succession.
>
> anishek
>
>
> On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng <br...@blockcypher.com> wrote:
>
>> Hi Anishek,
>>
>> In addition to the good advice others have given, do you notice any
>> abnormally large partitions? What does cfhistograms report for 99%
>> partition size? A few huge partitions will cause very disproportionate load
>> on your cluster, including high GC.
>>
>> --Bryan
>>
>> On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <amit.f.si...@ericsson.com>
>> wrote:
>>
>>> Hi Anishek,
>>>
>>>
>>>
>>> We too faced similar problem in 2.0.14 and after doing some research we
>>> config few parameters in Cassandra.yaml and was able to overcome GC pauses
>>> . Those are :
>>>
>>>
>>>
>>> ·         memtable_flush_writers : increased from 1 to 3 as from
>>> tpstats output  we can see mutations dropped so it means writes are getting
>>> blocked, so increasing number will have those catered.
>>>
>>> ·         memtable_total_space_in_mb : Default (1/4 of heap size), can
>>> lowered because larger long lived objects will create pressure on HEAP, so
>>> its better to reduce some amount of size.
>>>
>>> ·         Concurrent_compactors : Alain righlty pointed out this i.e
>>> reduce it to 8. You need to try this.
>>>
>>>
>>>
>>> Also please check whether you have mutations drop in other nodes or not.
>>>
>>>
>>>
>>> Hope this helps in your cluster too.
>>>
>>>
>>>
>>> Regards
>>>
>>> Amit Singh
>>>
>>> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
>>> *Sent:* Wednesday, March 02, 2016 9:33 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Lot of GC on two nodes out of 7
>>>
>>>
>>>
>>> Can you post a gist of the output of jstat -gccause (60 seconds worth)?
>>> I think it's cool you're willing to experiment with alternative JVM
>>> settings but I've never seen anyone use max tenuring threshold of 50 either
>>> and I can't imagine it's helpful.  Keep in mind if your objects are
>>> actually reaching that threshold it means they've been copied 50x (really
>>> really slow) and also you're going to end up spilling your eden objects
>>> directly into your old gen if your survivor is full.  Considering the small
>>> amount of memory you're using for heap I'm really not surprised you're
>>> running into problems.
>>>
>>>
>>>
>>> I recommend G1GC + 12GB heap and just let it optimize itself for almost
>>> all cases with the latest JVM versions.
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <arodr...@gmail.com>
>>> wrote:
>>>
>>> It looks like you are doing a good work with this cluster and know a lot
>>> about JVM, that's good :-).
>>>
>>>
>>>
>>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>>>
>>>
>>>
>>> That's good hardware too.
>>>
>>>
>>>
>>> With 64 GB of ram I would probably directly give a try to
>>> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.
>>>
>>>
>>>
>>> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
>>> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
>>> effects. But that's just an idea of something I would try to see the
>>> impacts, I don't think it will solve your current issues or even make it
>>> worse for this node.
>>>
>>>
>>>
>>> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
>>> allow you to store the memtables off-heap. Those are 2 improvements
>>> reducing the heap pressure that you might be interested in.
>>>
>>>
>>>
>>> I have spent time reading about all other options before including them
>>> and a similar configuration on our other prod cluster is showing good GC
>>> graphs via gcviewer.
>>>
>>>
>>>
>>> So, let's look for an other reason.
>>>
>>>
>>>
>>> there are MUTATION and READ messages dropped in high number on nodes in
>>> question and on other 5 nodes it varies between 1-3.
>>>
>>>
>>>
>>> - Is Memory, CPU or disk a bottleneck? Is one of those running at the
>>> limits?
>>>
>>>
>>>
>>> concurrent_compactors: 48
>>>
>>>
>>>
>>> Reducing this to 8 would free some space for transactions (R&W
>>> requests). It is probably worth a try, even more when compaction is not
>>> keeping up and compaction throughput is not throttled.
>>>
>>>
>>>
>>> Just found an issue about that:
>>> https://issues.apache.org/jira/browse/CASSANDRA-7139
>>>
>>>
>>>
>>> Looks like `concurrent_compactors: 8` is the new default.
>>>
>>>
>>>
>>> C*heers,
>>>
>>> -----------------------
>>>
>>> Alain Rodriguez - al...@thelastpickle.com
>>>
>>> France
>>>
>>>
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>>
>>> http://www.thelastpickle.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal <anis...@gmail.com>:
>>>
>>> Thanks a lot Alian for the details.
>>>
>>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>>> might want to keep this as it or even reduce it if you have less than 16 GB
>>> of native memory. Go with 8 GB if you have a lot of memory.
>>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>>> try with 15.
>>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>>> Using default and then tune from there to improve things is generally a
>>> good idea.
>>>
>>>
>>>
>>>
>>>
>>> we have a lot of reads and writes onto the system so keeping the high
>>> new size to make sure enough is held in memory including caches / memtables
>>> etc --number of flush_writers : 4 for us. similarly keeping less in old
>>> generation to make sure we spend less time with CMS GC most of the data is
>>> transient in memory for us. Keeping high TenuringThreshold because we don't
>>> want objects going to old generation and just die in young generation given
>>> we have configured large survivor spaces.
>>>
>>> using occupancyFraction as 70 since
>>>
>>> given heap is 4G
>>>
>>> survivor space is : 400 mb -- 2 survivor spaces
>>>
>>> 70 % of 2G (old generation) = 1.4G
>>>
>>>
>>>
>>> so once we are just below 1.4G and we have to move the full survivor +
>>> some extra during a par new gc due to promotion failure, everything will
>>> fit in old generation, and will trigger CMS.
>>>
>>>
>>>
>>> I have spent time reading about all other options before including them
>>> and a similar configuration on our other prod cluster is showing good GC
>>> graphs via gcviewer.
>>>
>>>
>>>
>>> tp stats on all machines show flush writer blocked at : 0.3% of total
>>>
>>>
>>>
>>> the two nodes in question have stats almost as below
>>>
>>>    - specifically there are pending was in readStage, MutationStage and
>>>    RequestResponseStage
>>>
>>> Pool Name                    Active   Pending      Completed   Blocked
>>> All time blocked
>>>
>>> ReadStage                        21        19     2141798645         0
>>>               0
>>>
>>> RequestResponseStage              0         1      803242391         0
>>>               0
>>>
>>> MutationStage                     0         0      291813703         0
>>>               0
>>>
>>> ReadRepairStage                   0         0      200544344         0
>>>               0
>>>
>>> ReplicateOnWriteStage             0         0              0         0
>>>               0
>>>
>>> GossipStage                       0         0         292477         0
>>>               0
>>>
>>> CacheCleanupExecutor              0         0              0         0
>>>               0
>>>
>>> MigrationStage                    0         0              0         0
>>>               0
>>>
>>> MemoryMeter                       0         0           2172         0
>>>               0
>>>
>>> FlushWriter                       0         0           2756         0
>>>               6
>>>
>>> ValidationExecutor                0         0            101         0
>>>               0
>>>
>>> InternalResponseStage             0         0              0         0
>>>               0
>>>
>>> AntiEntropyStage                  0         0            202         0
>>>               0
>>>
>>> MemtablePostFlusher               0         0           4395         0
>>>               0
>>>
>>> MiscStage                         0         0              0         0
>>>               0
>>>
>>> PendingRangeCalculator            0         0             20         0
>>>               0
>>>
>>> CompactionExecutor                4         4          49323         0
>>>               0
>>>
>>> commitlog_archiver                0         0              0         0
>>>               0
>>>
>>> HintedHandoff                     0         0            116         0
>>>               0
>>>
>>>
>>>
>>> Message type           Dropped
>>>
>>> RANGE_SLICE                  0
>>>
>>> READ_REPAIR                 36
>>>
>>> PAGED_RANGE                  0
>>>
>>> BINARY                       0
>>>
>>> READ                     11471
>>>
>>> MUTATION                   898
>>>
>>> _TRACE                       0
>>>
>>> REQUEST_RESPONSE             0
>>>
>>> COUNTER_MUTATION             0
>>>
>>>
>>>
>>> all the other 5 nodes show no pending numbers.
>>>
>>>
>>>
>>>
>>>
>>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>>>
>>> compaction throughput is 0 MB/s
>>>
>>> concurrent_compactors: 48
>>>
>>> flush_writers: 4
>>>
>>>
>>>
>>>
>>> I think Jeff is trying to spot a wide row messing with your system, so
>>> looking at the max row size on those nodes compared to other is more
>>> relevant than average size for this check.
>>>
>>>
>>>
>>> i think is what you are looking for, please correct me if i am wrong
>>>
>>> Compacted partition maximum bytes: 1629722
>>>
>>> similar value on all 7 nodes.
>>>
>>>
>>>
>>> grep -i "ERROR" /var/log/cassandra/system.log
>>>
>>>
>>>
>>> there are MUTATION and READ messages dropped in high number on nodes in
>>> question and on other 5 nodes it varies between 1-3.
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <arodr...@gmail.com>
>>> wrote:
>>>
>>> Hi Anishek,
>>>
>>>
>>>
>>> Even if it highly depends on your workload, here are my thoughts:
>>>
>>>
>>>
>>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>>>
>>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>>> might want to keep this as it or even reduce it if you have less than 16 GB
>>> of native memory. Go with 8 GB if you have a lot of memory.
>>>
>>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>>> try with 15.
>>>
>>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>>> Using default and then tune from there to improve things is generally a
>>> good idea.
>>>
>>>
>>>
>>> You also use a bunch of option I don't know about, if you are uncertain
>>> about them, you could try a default conf without the options you added and
>>> just the using the changes above from default
>>> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
>>> Or you might find more useful information on a nice reference about this
>>> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
>>> Virtual Machine' part:
>>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>>>
>>>
>>>
>>> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and
>>> G1GC. G1GC is supposed to be easier to configure too.
>>>
>>>
>>>
>>> the average row size for compacted partitions is about 1640 bytes on all
>>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>>
>>>
>>>
>>> I think Jeff is trying to spot a wide row messing with your system, so
>>> looking at the max row size on those nodes compared to other is more
>>> relevant than average size for this check.
>>>
>>>
>>>
>>> the only other thing that stands out in cfstats is the read time and
>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>> nodes, but i think thats expected.
>>>
>>>
>>>
>>> I would probably look at this the reverse way: I imagine that extra GC
>>>  is a consequence of something going wrong on those nodes as JVM / GC are
>>> configured the same way cluster-wide. GC / JVM issues are often due to
>>> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
>>> would try to tune JVM / GC only once the system is healthy. So I often saw
>>> high GC being a consequence rather than the root cause of an issue.
>>>
>>>
>>>
>>> To explore this possibility:
>>>
>>>
>>>
>>> Does this command show some dropped or blocked tasks? This would add
>>> pressure to heap.
>>>
>>> nodetool tpstats
>>>
>>>
>>>
>>> Do you have errors in logs? Always good to know when facing an issue.
>>>
>>> grep -i "ERROR" /var/log/cassandra/system.log
>>>
>>>
>>>
>>> How are compactions tuned (throughput + concurrent compactors)? This
>>> tuning might explain compactions not keeping up or a high GC pressure.
>>>
>>>
>>>
>>> What are your disks / CPU? To help us giving you good arbitrary values
>>> to try.
>>>
>>>
>>>
>>> Is there some iowait ? Could point to a bottleneck or bad hardware.
>>>
>>> iostats -mx 5 100
>>>
>>>
>>>
>>> ...
>>>
>>>
>>>
>>> Hope one of those will point you to an issue, but there are many more
>>> thing you could check.
>>>
>>>
>>>
>>> Let us know how it goes,
>>>
>>>
>>>
>>> C*heers,
>>>
>>> -----------------------
>>>
>>> Alain Rodriguez - al...@thelastpickle.com
>>>
>>> France
>>>
>>>
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>>
>>> http://www.thelastpickle.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <anis...@gmail.com>:
>>>
>>> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <anis...@gmail.com>
>>> wrote:
>>>
>>> Hey Jeff,
>>>
>>>
>>>
>>> one of the nodes with high GC has 1400 SST tables, all other nodes have
>>> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>>>
>>>
>>>
>>> the average row size for compacted partitions is about 1640 bytes on all
>>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>>
>>> the only other thing that stands out in cfstats is the read time and
>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>> nodes, but i think thats expected.
>>>
>>>
>>>
>>> thanks
>>>
>>> anishek
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
>>> wrote:
>>>
>>> Compaction falling behind will likely cause additional work on reads
>>> (more sstables to merge), but I’d be surprised if it manifested in super
>>> long GC. When you say twice as many sstables, how many is that?.
>>>
>>>
>>>
>>> In cfstats, does anything stand out? Is max row size on those nodes
>>> larger than on other nodes?
>>>
>>>
>>>
>>> What you don’t show in your JVM options is the new gen size – if you do
>>> have unusually large partitions on those two nodes (especially likely if
>>> you have rf=2 – if you have rf=3, then there’s probably a third node
>>> misbehaving you haven’t found yet), then raising new gen size can help
>>> handle the garbage created by reading large partitions without having to
>>> tolerate the promotion. Estimates for the amount of garbage vary, but it
>>> could be “gigabytes” of garbage on a very wide partition (see
>>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in
>>> progress to help mitigate that type of pain).
>>>
>>>
>>>
>>> - Jeff
>>>
>>>
>>>
>>> *From: *Anishek Agarwal
>>> *Reply-To: *"user@cassandra.apache.org"
>>> *Date: *Tuesday, March 1, 2016 at 11:12 PM
>>> *To: *"user@cassandra.apache.org"
>>> *Subject: *Lot of GC on two nodes out of 7
>>>
>>>
>>>
>>> Hello,
>>>
>>>
>>>
>>> we have a cassandra cluster of 7 nodes, all of them have the same JVM GC
>>> configurations, all our writes /  reads use the TokenAware Policy wrapping
>>> a DCAware policy. All nodes are part of same Datacenter.
>>>
>>>
>>>
>>> We are seeing that two nodes are having high GC collection times. Then
>>> mostly seem to spend time in GC like about 300-600 ms. This also seems to
>>> result in higher CPU utilisation on these machines. Other  5 nodes don't
>>> have this problem.
>>>
>>>
>>>
>>> There is no additional repair activity going on the cluster, we are not
>>> sure why this is happening.
>>>
>>> we checked cfhistograms on the two CF we have in the cluster and number
>>> of reads seems to be almost same.
>>>
>>>
>>>
>>> we also used cfstats to see the number of ssttables on each node and one
>>> of the nodes with the above problem has twice the number of ssttables than
>>> other nodes. This still doesnot explain why two nodes have high GC
>>> Overheads. our GC config is as below:
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>>
>>> # earlier value 131072 = 32768 * 4
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>>>
>>> #new
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>>>
>>>
>>>
>>> We are using cassandra 2.0.17. If anyone has any suggestion as to how
>>> what else we can look for to understand why this is happening please do
>>> reply.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks
>>>
>>> anishek
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Lot of GC on two nodes out of 7

Reply via email to