Hi Dan, You're welcome, but I must admit you solved it on your own as I was about to advice you reducing all the JVM stuff, the exact contrary to the working solution you found :-). As 48 GB is a lot (I would have say something like 26 GB heap, and memtables about 4GB or something like that) to try reducing GC Pause time and leave some more free space for page caching. Truth is without accessing the cluster, the best we can do is guessing. The operator is the only one having all the needed informations ;-).
If things are running smoothly and efficiently enough, don't try anything else, just stick with the working config imho. Glad you figured it out while I was out, sorry I missed the follow-up. C*heers, ----------------------- Alain Rodriguez - al...@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-03-08 20:11 GMT+01:00 Dan Kinder <dkin...@turnitin.com>: > Quick follow-up here, so far I've had these nodes stable for about 2 days > now with the following (still mysterious) solution: *increase* > memtable_heap_space_in_mb > to 20GB. This was having issues at the default value of 1/4 heap (12GB in > my case, I misspoke earlier and said 16GB). Upping it to 20GB seems to have > made the issue go away so far. > > Best guess now is that it simply was memtable flush throughput. Playing > with memtable_cleanup_threshold further may have also helped but I didn't > want to create small SSTables. > > Thanks again for the input @Alain. > > On Fri, Mar 4, 2016 at 4:53 PM, Dan Kinder <dkin...@turnitin.com> wrote: > >> Hi thanks for responding Alain. Going to provide more info inline. >> >> However a small update that is probably relevant: while the node was in >> this state (MemtableReclaimMemory building up), since this cluster is >> not serving live traffic I temporarily turned off ALL client traffic, and >> the node still never recovered, MemtableReclaimMemory never went down. >> Seems like there is one thread doing this reclaiming and it has gotten >> stuck somehow. >> >> Will let you know when I have more results from experimenting... but >> again, merci >> >> On Thu, Mar 3, 2016 at 2:32 AM, Alain RODRIGUEZ <arodr...@gmail.com> >> wrote: >> >>> Hi Dan, >>> >>> I'll try to go through all the elements: >>> >>> seeing this odd behavior happen, seemingly to single nodes at a time >>> >>> >>> Is that one node at the time or always on the same node. Do you consider >>> your data model if fairly, evenly distributed ? >>> >> >> of 6 nodes, 2 of them seem to be the recurring culprits. Could be related >> to a particular data partition. >> >> >>> >>> The node starts to take more and more memory (instance has 48GB memory >>>> on G1GC) >>> >>> >>> Do you use 48 GB heap size or is that the total amount of memory in the >>> node ? Could we have your JVM settings (GC and heap sizes), also memtable >>> size and type (off heap?) and the amount of available memory ? >>> >> >> Machine spec: 24 virtual cores, 64GB memory, 12 HDD JBOD (yes an absurd >> number of disks, not my choice) >> >> memtable_heap_space_in_mb: 10240 # 10GB (previously left as default >> which was 16GB and caused the issue more frequently) >> memtable_allocation_type: heap_buffers >> memtable_flush_writers: 12 >> >> MAX_HEAP_SIZE="48G" >> JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}" >> JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}" >> >> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC" >> JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500" >> JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5" >> JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25" >> >>> >>> Note that there is a decent number of compactions going on as well but >>>> that is expected on these nodes and this particular one is catching up from >>>> a high volume of writes >>>> >>> >>> Are the *concurrent_compactors* correctly throttled (about 8 with good >>> machines) and the *compaction_throughput_mb_per_sec* high enough to >>> cope with what is thrown at the node ? Using SSD I often see the latter >>> unthrottled (using 0 value), but I would try small increments first. >>> >> concurrent_compactors: 12 >> compaction_throughput_mb_per_sec: 0 >> >>> >>> Also interestingly, neither CPU nor disk utilization are pegged while >>>> this is going on >>>> >>> >>> First thing is making sure your memory management is fine. Having >>> information about the JVM and memory usage globally would help. Then, if >>> you are not fully using the resources you might want to try increasing the >>> number of *concurrent_writes* to a higher value (probably a way higher, >>> given the pending requests, but go safely, incrementally, first on a canary >>> node) and monitor tpstats + resources. Hope this will help Mutation pending >>> going down. My guess is that pending requests are messing with the JVM, but >>> it could be the exact contrary as well. >>> >> concurrent_writes: 192 >> It may be worth noting that the main reads going on are large batch >> reads, while these writes are happening (akin to analytics jobs). >> >> I'm going to look into JVM use a bit more but otherwise it seems like >> normal Young generation GCs are happening even as this problem surfaces. >> >> >>> >>> Native-Transport-Requests 25 0 547935519 0 >>>> 2586907 >>> >>> >>> About Native requests being blocked, you can probably mitigate things by >>> increasing the native_transport_max_threads: 128 (try to double it and >>> continue tuning incrementally). Also, an up to date client, using the >>> Native protocol V3 handles a lot better connections / threads from clients. >>> Having an heavy throughput like yours, you might want to give this a try. >>> >> >> This one is a good idea and I'll probably try increasing it, but I don't >> really see these back up so. >> >> >>> >>> What is your current client ? >>> What does "netstat -an | grep -e 9042 -e 9160 | grep ESTABLISHED | wc >>> -l" outputs ? This is the number of clients connected to the node. >>> Do you have other significant errors or warning in the logs (other than >>> dropped mutations)? "grep -i -e "ERROR" -e "WARN" >>> /var/log/cassandra/system.log" >>> >> >> 435 incoming connections, only warning is compaction of some large >> partitions. >> >> >>> >>> As a small conclusion I would have an eye on things related to the >>> memory management and also trying to push Cassandra limits by increasing >>> default values as you seems to have resources available, to make sure >>> Cassandra can cope with the high throughput. Pending operations = high >>> memory pressure. Reducing pending stuff somehow will probably get you out >>> off troubles. >>> >>> Hope this first round of ideas will help you. >>> >>> C*heers, >>> ----------------------- >>> Alain Rodriguez - al...@thelastpickle.com >>> France >>> >>> The Last Pickle - Apache Cassandra Consulting >>> http://www.thelastpickle.com >>> >>> 2016-03-02 22:58 GMT+01:00 Dan Kinder <dkin...@turnitin.com>: >>> >>>> Also should note: Cassandra 2.2.5, Centos 6.7 >>>> >>>> On Wed, Mar 2, 2016 at 1:34 PM, Dan Kinder <dkin...@turnitin.com> >>>> wrote: >>>> >>>>> Hi y'all, >>>>> >>>>> I am writing to a cluster fairly fast and seeing this odd behavior >>>>> happen, seemingly to single nodes at a time. The node starts to take more >>>>> and more memory (instance has 48GB memory on G1GC). tpstats shows that >>>>> MemtableReclaimMemory Pending starts to grow first, then later >>>>> MutationStage builds up as well. By then most of the memory is being >>>>> consumed, GC is getting longer, node slows down and everything slows down >>>>> unless I kill the node. Also the number of Active MemtableReclaimMemory >>>>> threads seems to stay at 1. Also interestingly, neither CPU nor disk >>>>> utilization are pegged while this is going on; it's on jbod and there is >>>>> plenty of headroom there. (Note that there is a decent number of >>>>> compactions going on as well but that is expected on these nodes and this >>>>> particular one is catching up from a high volume of writes). >>>>> >>>>> Anyone have any theories on why this would be happening? >>>>> >>>>> >>>>> $ nodetool tpstats >>>>> Pool Name Active Pending Completed Blocked >>>>> All time blocked >>>>> MutationStage 192 715481 311327142 0 >>>>> 0 >>>>> ReadStage 7 0 9142871 0 >>>>> 0 >>>>> RequestResponseStage 1 0 690823199 0 >>>>> 0 >>>>> ReadRepairStage 0 0 2145627 0 >>>>> 0 >>>>> CounterMutationStage 0 0 0 0 >>>>> 0 >>>>> HintedHandoff 0 0 144 0 >>>>> 0 >>>>> MiscStage 0 0 0 0 >>>>> 0 >>>>> CompactionExecutor 12 24 41022 0 >>>>> 0 >>>>> MemtableReclaimMemory 1 102 4263 0 >>>>> 0 >>>>> PendingRangeCalculator 0 0 10 0 >>>>> 0 >>>>> GossipStage 0 0 148329 0 >>>>> 0 >>>>> MigrationStage 0 0 0 0 >>>>> 0 >>>>> MemtablePostFlush 0 0 5233 0 >>>>> 0 >>>>> ValidationExecutor 0 0 0 0 >>>>> 0 >>>>> Sampler 0 0 0 0 >>>>> 0 >>>>> MemtableFlushWriter 0 0 4270 0 >>>>> 0 >>>>> InternalResponseStage 0 0 16322698 0 >>>>> 0 >>>>> AntiEntropyStage 0 0 0 0 >>>>> 0 >>>>> CacheCleanupExecutor 0 0 0 0 >>>>> 0 >>>>> Native-Transport-Requests 25 0 547935519 0 >>>>> 2586907 >>>>> >>>>> Message type Dropped >>>>> READ 0 >>>>> RANGE_SLICE 0 >>>>> _TRACE 0 >>>>> MUTATION 287057 >>>>> COUNTER_MUTATION 0 >>>>> REQUEST_RESPONSE 0 >>>>> PAGED_RANGE 0 >>>>> READ_REPAIR 149 >>>>> >>>>> >>>> >>>> >>>> -- >>>> Dan Kinder >>>> Principal Software Engineer >>>> Turnitin – www.turnitin.com >>>> dkin...@turnitin.com >>>> >>> >> >