Yes, looks like you have a (at least one) 100MB partition which is big
enough to cause issues. When you do lots of writes to the large partition
it is likely to end up getting compacted (as per the log) and compactions
often use a lot of memory / cause a lot of GC when they hit large
partitions. This, in addition to the write load is probably pushing you
over the edge.

There are some improvements in 3.6 that might help (
https://issues.apache.org/jira/browse/CASSANDRA-11206) but the 2.2 to 3.x
upgrade path seems risky at best at the moment. In any event, your best
solution would be to find a way to make your partitions smaller (like
1/10th of the size).

Cheers
Ben
<https://issues.apache.org/jira/browse/CASSANDRA-11206>

On Wed, 3 Aug 2016 at 12:35 Kevin Burton <bur...@spinn3r.com> wrote:

> I have a theory as to what I think is happening here.
>
> There is a correlation between the massive content all at once, and our
> outags.
>
> Our scheme uses large buckets of content where we write to a
> bucket/partition for 5 minutes, then move to a new one.  This way we can
> page through buckets.
>
> I think what's happening is that CS is reading the entire partition into
> memory, then slicing through it... which would explain why its running out
> of memory.
>
> system.log:WARN  [CompactionExecutor:294] 2016-08-03 02:01:55,659
> BigTableWriter.java:184 - Writing large partition
> blogindex/content_legacy_2016_08_02:1470154500099 (106107128 bytes)
>
> On Tue, Aug 2, 2016 at 6:43 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> We have a 60 node CS cluster running 2.2.7 and about 20GB of RAM
>> allocated to each C* node.  We're aware of the recommended 8GB limit to
>> keep GCs low but our memory has been creeping up (probably) related to this
>> bug.
>>
>> Here's what we're seeing... if we do a low level of writes we think
>> everything generally looks good.
>>
>> What happens is that we then need to catch up and then do a TON of writes
>> all in a small time window.  Then CS nodes start dropping like flies.  Some
>> of them just GC frequently and are able to recover. When they GC like this
>> we see GC pause in the 30 second range which then cause them to not gossip
>> for a while and they drop out of the cluster.
>>
>> This happens as a flurry around the cluster so we're not always able to
>> catch which ones are doing it as they recover. However, if we have 3 down,
>> we mostly have a locked up cluster.  Writes don't complete and our app
>> essentially locks up.
>>
>> SOME of the boxes never recover. I'm in this state now.  We have t3-5
>> nodes that are in GC storms which they won't recover from.
>>
>> I reconfigured the GC settings to enable jstat.
>>
>> I was able to catch it while it was happening:
>>
>> ^Croot@util0067 ~ # sudo -u cassandra jstat -gcutil 4235 2500
>>   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
>>   GCT
>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>> 2825.332
>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>> 2825.332
>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>> 2825.332
>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>> 2825.332
>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>> 2825.332
>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>> 2825.332
>>
>> ... as you can see the box is legitimately out of memory.  S0, S1, E and
>> O are all completely full.
>>
>> I'm not sure were to go from here.  I think 20GB for our work load is
>> more than reasonable.
>>
>> 90% of the time they're well below 10GB of RAM used.  While I was
>> watching this box I was seeing 30% RAM used until it decided to climb to
>> 100%
>>
>> Any advice on what do do next... I don't see anything obvious in the logs
>> to signal a problem.
>>
>> I attached all the command line arguments we use.  Note that I think that
>> the cassandra-env.sh script puts them in there twice.
>>
>> -ea
>> -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar
>> -XX:+CMSClassUnloadingEnabled
>> -XX:+UseThreadPriorities
>> -XX:ThreadPriorityPolicy=42
>> -Xms20000M
>> -Xmx20000M
>> -Xmn4096M
>> -XX:+HeapDumpOnOutOfMemoryError
>> -Xss256k
>> -XX:StringTableSize=1000003
>> -XX:+UseParNewGC
>> -XX:+UseConcMarkSweepGC
>> -XX:+CMSParallelRemarkEnabled
>> -XX:SurvivorRatio=8
>> -XX:MaxTenuringThreshold=1
>> -XX:CMSInitiatingOccupancyFraction=75
>> -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:+UseTLAB
>> -XX:CompileCommandFile=/hotspot_compiler
>> -XX:CMSWaitDuration=10000
>> -XX:+CMSParallelInitialMarkEnabled
>> -XX:+CMSEdenChunksRecordAlways
>> -XX:CMSWaitDuration=10000
>> -XX:+UseCondCardMark
>> -XX:+PrintGCDetails
>> -XX:+PrintGCDateStamps
>> -XX:+PrintHeapAtGC
>> -XX:+PrintTenuringDistribution
>> -XX:+PrintGCApplicationStoppedTime
>> -XX:+PrintPromotionFailure
>> -XX:PrintFLSStatistics=1
>> -Xloggc:/var/log/cassandra/gc.log
>> -XX:+UseGCLogFileRotation
>> -XX:NumberOfGCLogFiles=10
>> -XX:GCLogFileSize=10M
>> -Djava.net.preferIPv4Stack=true
>> -Dcom.sun.management.jmxremote.port=7199
>> -Dcom.sun.management.jmxremote.rmi.port=7199
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Djava.library.path=/usr/share/cassandra/lib/sigar-bin
>> -XX:+UnlockCommercialFeatures
>> -XX:+FlightRecorder
>> -ea
>> -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar
>> -XX:+CMSClassUnloadingEnabled
>> -XX:+UseThreadPriorities
>> -XX:ThreadPriorityPolicy=42
>> -Xms20000M
>> -Xmx20000M
>> -Xmn4096M
>> -XX:+HeapDumpOnOutOfMemoryError
>> -Xss256k
>> -XX:StringTableSize=1000003
>> -XX:+UseParNewGC
>> -XX:+UseConcMarkSweepGC
>> -XX:+CMSParallelRemarkEnabled
>> -XX:SurvivorRatio=8
>> -XX:MaxTenuringThreshold=1
>> -XX:CMSInitiatingOccupancyFraction=75
>> -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:+UseTLAB
>> -XX:CompileCommandFile=/etc/cassandra/hotspot_compiler
>> -XX:CMSWaitDuration=10000
>> -XX:+CMSParallelInitialMarkEnabled
>> -XX:+CMSEdenChunksRecordAlways
>> -XX:CMSWaitDuration=10000
>> -XX:+UseCondCardMark
>> -XX:+PrintGCDetails
>> -XX:+PrintGCDateStamps
>> -XX:+PrintHeapAtGC
>> -XX:+PrintTenuringDistribution
>> -XX:+PrintGCApplicationStoppedTime
>> -XX:+PrintPromotionFailure
>> -XX:PrintFLSStatistics=1
>> -Xloggc:/var/log/cassandra/gc.log
>> -XX:+UseGCLogFileRotation
>> -XX:NumberOfGCLogFiles=10
>> -XX:GCLogFileSize=10M
>> -Djava.net.preferIPv4Stack=true
>> -Dcom.sun.management.jmxremote.port=7199
>> -Dcom.sun.management.jmxremote.rmi.port=7199
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Djava.library.path=/usr/share/cassandra/lib/sigar-bin
>> -XX:+UnlockCommercialFeatures
>> -XX:+FlightRecorder
>> -Dlogback.configurationFile=logback.xml
>> -Dcassandra.logdir=/var/log/cassandra
>> -Dcassandra.storagedir=
>> -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
>
> --
————————
Ben Slater
Chief Product Officer
Instaclustr: Cassandra + Spark - Managed | Consulting | Support
+61 437 929 798

Reply via email to