Re: Lot of GC on two nodes out of 7

Bryan Cheng Thu, 03 Mar 2016 16:06:55 -0800

Hi Anishek,

In addition to the good advice others have given, do you notice any
abnormally large partitions? What does cfhistograms report for 99%
partition size? A few huge partitions will cause very disproportionate load
on your cluster, including high GC.


--Bryan

On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <amit.f.si...@ericsson.com>
wrote:

> Hi Anishek,
>
>
>
> We too faced similar problem in 2.0.14 and after doing some research we
> config few parameters in Cassandra.yaml and was able to overcome GC pauses
> . Those are :
>
>
>
> ·         memtable_flush_writers : increased from 1 to 3 as from tpstats
> output  we can see mutations dropped so it means writes are getting
> blocked, so increasing number will have those catered.
>
> ·         memtable_total_space_in_mb : Default (1/4 of heap size), can
> lowered because larger long lived objects will create pressure on HEAP, so
> its better to reduce some amount of size.
>
> ·         Concurrent_compactors : Alain righlty pointed out this i.e
> reduce it to 8. You need to try this.
>
>
>
> Also please check whether you have mutations drop in other nodes or not.
>
>
>
> Hope this helps in your cluster too.
>
>
>
> Regards
>
> Amit Singh
>
> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
> *Sent:* Wednesday, March 02, 2016 9:33 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Lot of GC on two nodes out of 7
>
>
>
> Can you post a gist of the output of jstat -gccause (60 seconds worth)?  I
> think it's cool you're willing to experiment with alternative JVM settings
> but I've never seen anyone use max tenuring threshold of 50 either and I
> can't imagine it's helpful.  Keep in mind if your objects are actually
> reaching that threshold it means they've been copied 50x (really really
> slow) and also you're going to end up spilling your eden objects directly
> into your old gen if your survivor is full.  Considering the small amount
> of memory you're using for heap I'm really not surprised you're running
> into problems.
>
>
>
> I recommend G1GC + 12GB heap and just let it optimize itself for almost
> all cases with the latest JVM versions.
>
>
>
> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <arodr...@gmail.com> wrote:
>
> It looks like you are doing a good work with this cluster and know a lot
> about JVM, that's good :-).
>
>
>
> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>
>
>
> That's good hardware too.
>
>
>
> With 64 GB of ram I would probably directly give a try to
> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.
>
>
>
> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
> effects. But that's just an idea of something I would try to see the
> impacts, I don't think it will solve your current issues or even make it
> worse for this node.
>
>
>
> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
> allow you to store the memtables off-heap. Those are 2 improvements
> reducing the heap pressure that you might be interested in.
>
>
>
> I have spent time reading about all other options before including them
> and a similar configuration on our other prod cluster is showing good GC
> graphs via gcviewer.
>
>
>
> So, let's look for an other reason.
>
>
>
> there are MUTATION and READ messages dropped in high number on nodes in
> question and on other 5 nodes it varies between 1-3.
>
>
>
> - Is Memory, CPU or disk a bottleneck? Is one of those running at the
> limits?
>
>
>
> concurrent_compactors: 48
>
>
>
> Reducing this to 8 would free some space for transactions (R&W requests).
> It is probably worth a try, even more when compaction is not keeping up and
> compaction throughput is not throttled.
>
>
>
> Just found an issue about that:
> https://issues.apache.org/jira/browse/CASSANDRA-7139
>
>
>
> Looks like `concurrent_compactors: 8` is the new default.
>
>
>
> C*heers,
>
> -----------------------
>
> Alain Rodriguez - al...@thelastpickle.com
>
> France
>
>
>
> The Last Pickle - Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal <anis...@gmail.com>:
>
> Thanks a lot Alian for the details.
>
> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
> might want to keep this as it or even reduce it if you have less than 16 GB
> of native memory. Go with 8 GB if you have a lot of memory.
> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
> far. I had luck with values between 4 <--> 16 in the past. I would give  a
> try with 15.
> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
> Using default and then tune from there to improve things is generally a
> good idea.
>
>
>
>
>
> we have a lot of reads and writes onto the system so keeping the high new
> size to make sure enough is held in memory including caches / memtables etc
> --number of flush_writers : 4 for us. similarly keeping less in old
> generation to make sure we spend less time with CMS GC most of the data is
> transient in memory for us. Keeping high TenuringThreshold because we don't
> want objects going to old generation and just die in young generation given
> we have configured large survivor spaces.
>
> using occupancyFraction as 70 since
>
> given heap is 4G
>
> survivor space is : 400 mb -- 2 survivor spaces
>
> 70 % of 2G (old generation) = 1.4G
>
>
>
> so once we are just below 1.4G and we have to move the full survivor +
> some extra during a par new gc due to promotion failure, everything will
> fit in old generation, and will trigger CMS.
>
>
>
> I have spent time reading about all other options before including them
> and a similar configuration on our other prod cluster is showing good GC
> graphs via gcviewer.
>
>
>
> tp stats on all machines show flush writer blocked at : 0.3% of total
>
>
>
> the two nodes in question have stats almost as below
>
>    - specifically there are pending was in readStage, MutationStage and
>    RequestResponseStage
>
> Pool Name                    Active   Pending      Completed   Blocked
> All time blocked
>
> ReadStage                        21        19     2141798645         0
>             0
>
> RequestResponseStage              0         1      803242391         0
>             0
>
> MutationStage                     0         0      291813703         0
>             0
>
> ReadRepairStage                   0         0      200544344         0
>             0
>
> ReplicateOnWriteStage             0         0              0         0
>             0
>
> GossipStage                       0         0         292477         0
>             0
>
> CacheCleanupExecutor              0         0              0         0
>             0
>
> MigrationStage                    0         0              0         0
>             0
>
> MemoryMeter                       0         0           2172         0
>             0
>
> FlushWriter                       0         0           2756         0
>             6
>
> ValidationExecutor                0         0            101         0
>             0
>
> InternalResponseStage             0         0              0         0
>             0
>
> AntiEntropyStage                  0         0            202         0
>             0
>
> MemtablePostFlusher               0         0           4395         0
>             0
>
> MiscStage                         0         0              0         0
>             0
>
> PendingRangeCalculator            0         0             20         0
>             0
>
> CompactionExecutor                4         4          49323         0
>             0
>
> commitlog_archiver                0         0              0         0
>             0
>
> HintedHandoff                     0         0            116         0
>             0
>
>
>
> Message type           Dropped
>
> RANGE_SLICE                  0
>
> READ_REPAIR                 36
>
> PAGED_RANGE                  0
>
> BINARY                       0
>
> READ                     11471
>
> MUTATION                   898
>
> _TRACE                       0
>
> REQUEST_RESPONSE             0
>
> COUNTER_MUTATION             0
>
>
>
> all the other 5 nodes show no pending numbers.
>
>
>
>
>
> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>
> compaction throughput is 0 MB/s
>
> concurrent_compactors: 48
>
> flush_writers: 4
>
>
>
>
> I think Jeff is trying to spot a wide row messing with your system, so
> looking at the max row size on those nodes compared to other is more
> relevant than average size for this check.
>
>
>
> i think is what you are looking for, please correct me if i am wrong
>
> Compacted partition maximum bytes: 1629722
>
> similar value on all 7 nodes.
>
>
>
> grep -i "ERROR" /var/log/cassandra/system.log
>
>
>
> there are MUTATION and READ messages dropped in high number on nodes in
> question and on other 5 nodes it varies between 1-3.
>
>
>
> On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <arodr...@gmail.com>
> wrote:
>
> Hi Anishek,
>
>
>
> Even if it highly depends on your workload, here are my thoughts:
>
>
>
> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>
> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
> might want to keep this as it or even reduce it if you have less than 16 GB
> of native memory. Go with 8 GB if you have a lot of memory.
>
> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
> far. I had luck with values between 4 <--> 16 in the past. I would give  a
> try with 15.
>
> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
> Using default and then tune from there to improve things is generally a
> good idea.
>
>
>
> You also use a bunch of option I don't know about, if you are uncertain
> about them, you could try a default conf without the options you added and
> just the using the changes above from default
> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
> Or you might find more useful information on a nice reference about this
> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
> Virtual Machine' part:
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>
>
>
> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and
> G1GC. G1GC is supposed to be easier to configure too.
>
>
>
> the average row size for compacted partitions is about 1640 bytes on all
> nodes. We have replication factor 3 but the problem is only on two nodes.
>
>
>
> I think Jeff is trying to spot a wide row messing with your system, so
> looking at the max row size on those nodes compared to other is more
> relevant than average size for this check.
>
>
>
> the only other thing that stands out in cfstats is the read time and write
> time on the nodes with high GC is 5-7 times higher than other 5 nodes, but
> i think thats expected.
>
>
>
> I would probably look at this the reverse way: I imagine that extra GC  is
> a consequence of something going wrong on those nodes as JVM / GC are
> configured the same way cluster-wide. GC / JVM issues are often due to
> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
> would try to tune JVM / GC only once the system is healthy. So I often saw
> high GC being a consequence rather than the root cause of an issue.
>
>
>
> To explore this possibility:
>
>
>
> Does this command show some dropped or blocked tasks? This would add
> pressure to heap.
>
> nodetool tpstats
>
>
>
> Do you have errors in logs? Always good to know when facing an issue.
>
> grep -i "ERROR" /var/log/cassandra/system.log
>
>
>
> How are compactions tuned (throughput + concurrent compactors)? This
> tuning might explain compactions not keeping up or a high GC pressure.
>
>
>
> What are your disks / CPU? To help us giving you good arbitrary values to
> try.
>
>
>
> Is there some iowait ? Could point to a bottleneck or bad hardware.
>
> iostats -mx 5 100
>
>
>
> ...
>
>
>
> Hope one of those will point you to an issue, but there are many more
> thing you could check.
>
>
>
> Let us know how it goes,
>
>
>
> C*heers,
>
> -----------------------
>
> Alain Rodriguez - al...@thelastpickle.com
>
> France
>
>
>
> The Last Pickle - Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
>
>
>
>
>
>
> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <anis...@gmail.com>:
>
> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>
>
>
> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <anis...@gmail.com> wrote:
>
> Hey Jeff,
>
>
>
> one of the nodes with high GC has 1400 SST tables, all other nodes have
> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>
>
>
> the average row size for compacted partitions is about 1640 bytes on all
> nodes. We have replication factor 3 but the problem is only on two nodes.
>
> the only other thing that stands out in cfstats is the read time and write
> time on the nodes with high GC is 5-7 times higher than other 5 nodes, but
> i think thats expected.
>
>
>
> thanks
>
> anishek
>
>
>
>
>
>
>
>
>
> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
> wrote:
>
> Compaction falling behind will likely cause additional work on reads (more
> sstables to merge), but I’d be surprised if it manifested in super long GC.
> When you say twice as many sstables, how many is that?.
>
>
>
> In cfstats, does anything stand out? Is max row size on those nodes larger
> than on other nodes?
>
>
>
> What you don’t show in your JVM options is the new gen size – if you do
> have unusually large partitions on those two nodes (especially likely if
> you have rf=2 – if you have rf=3, then there’s probably a third node
> misbehaving you haven’t found yet), then raising new gen size can help
> handle the garbage created by reading large partitions without having to
> tolerate the promotion. Estimates for the amount of garbage vary, but it
> could be “gigabytes” of garbage on a very wide partition (see
> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress
> to help mitigate that type of pain).
>
>
>
> - Jeff
>
>
>
> *From: *Anishek Agarwal
> *Reply-To: *"user@cassandra.apache.org"
> *Date: *Tuesday, March 1, 2016 at 11:12 PM
> *To: *"user@cassandra.apache.org"
> *Subject: *Lot of GC on two nodes out of 7
>
>
>
> Hello,
>
>
>
> we have a cassandra cluster of 7 nodes, all of them have the same JVM GC
> configurations, all our writes /  reads use the TokenAware Policy wrapping
> a DCAware policy. All nodes are part of same Datacenter.
>
>
>
> We are seeing that two nodes are having high GC collection times. Then
> mostly seem to spend time in GC like about 300-600 ms. This also seems to
> result in higher CPU utilisation on these machines. Other  5 nodes don't
> have this problem.
>
>
>
> There is no additional repair activity going on the cluster, we are not
> sure why this is happening.
>
> we checked cfhistograms on the two CF we have in the cluster and number of
> reads seems to be almost same.
>
>
>
> we also used cfstats to see the number of ssttables on each node and one
> of the nodes with the above problem has twice the number of ssttables than
> other nodes. This still doesnot explain why two nodes have high GC
> Overheads. our GC config is as below:
>
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>
> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>
> # earlier value 131072 = 32768 * 4
>
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>
> #new
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>
>
>
> We are using cassandra 2.0.17. If anyone has any suggestion as to how what
> else we can look for to understand why this is happening please do reply.
>
>
>
>
>
>
>
> Thanks
>
> anishek
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Lot of GC on two nodes out of 7

Reply via email to