Quick follow-up here, so far I've had these nodes stable for about 2 days
now with the following (still mysterious) solution: *increase*
memtable_heap_space_in_mb
to 20GB. This was having issues at the default value of 1/4 heap (12GB in
my case, I misspoke earlier and said 16GB). Upping it to 20GB seems to have
made the issue go away so far.

Best guess now is that it simply was memtable flush throughput. Playing
with memtable_cleanup_threshold further may have also helped but I didn't
want to create small SSTables.

Thanks again for the input @Alain.

On Fri, Mar 4, 2016 at 4:53 PM, Dan Kinder <dkin...@turnitin.com> wrote:

> Hi thanks for responding Alain. Going to provide more info inline.
>
> However a small update that is probably relevant: while the node was in
> this state (MemtableReclaimMemory building up), since this cluster is not
> serving live traffic I temporarily turned off ALL client traffic, and the
> node still never recovered, MemtableReclaimMemory never went down. Seems
> like there is one thread doing this reclaiming and it has gotten stuck
> somehow.
>
> Will let you know when I have more results from experimenting... but
> again, merci
>
> On Thu, Mar 3, 2016 at 2:32 AM, Alain RODRIGUEZ <arodr...@gmail.com>
> wrote:
>
>> Hi Dan,
>>
>> I'll try to go through all the elements:
>>
>> seeing this odd behavior happen, seemingly to single nodes at a time
>>
>>
>> Is that one node at the time or always on the same node. Do you consider
>> your data model if fairly, evenly distributed ?
>>
>
> of 6 nodes, 2 of them seem to be the recurring culprits. Could be related
> to a particular data partition.
>
>
>>
>> The node starts to take more and more memory (instance has 48GB memory on
>>> G1GC)
>>
>>
>> Do you use 48 GB heap size or is that the total amount of memory in the
>> node ? Could we have your JVM settings (GC and heap sizes), also memtable
>> size and type (off heap?) and the amount of available memory ?
>>
>
> Machine spec: 24 virtual cores, 64GB memory, 12 HDD JBOD (yes an absurd
> number of disks, not my choice)
>
> memtable_heap_space_in_mb: 10240 # 10GB (previously left as default which
> was 16GB and caused the issue more frequently)
> memtable_allocation_type: heap_buffers
> memtable_flush_writers: 12
>
> MAX_HEAP_SIZE="48G"
> JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}"
> JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
> JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500"
> JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5"
> JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25"
>
>>
>> Note that there is a decent number of compactions going on as well but
>>> that is expected on these nodes and this particular one is catching up from
>>> a high volume of writes
>>>
>>
>> Are the *concurrent_compactors* correctly throttled (about 8 with good
>> machines) and the *compaction_throughput_mb_per_sec* high enough to cope
>> with what is thrown at the node ? Using SSD I often see the latter
>> unthrottled (using 0 value), but I would try small increments first.
>>
> concurrent_compactors: 12
> compaction_throughput_mb_per_sec: 0
>
>>
>> Also interestingly, neither CPU nor disk utilization are pegged while
>>> this is going on
>>>
>>
>> First thing is making sure your memory management is fine. Having
>> information about the JVM and memory usage globally would help. Then, if
>> you are not fully using the resources you might want to try increasing the
>> number of *concurrent_writes* to a higher value (probably a way higher,
>> given the pending requests, but go safely, incrementally, first on a canary
>> node) and monitor tpstats + resources. Hope this will help Mutation pending
>> going down. My guess is that pending requests are messing with the JVM, but
>> it could be the exact contrary as well.
>>
> concurrent_writes: 192
> It may be worth noting that the main reads going on are large batch reads,
> while these writes are happening (akin to analytics jobs).
>
> I'm going to look into JVM use a bit more but otherwise it seems like
> normal Young generation GCs are happening even as this problem surfaces.
>
>
>>
>> Native-Transport-Requests        25         0      547935519         0
>>>         2586907
>>
>>
>> About Native requests being blocked, you can probably mitigate things by
>> increasing the native_transport_max_threads: 128 (try to double it and
>> continue tuning incrementally). Also, an up to date client, using the
>> Native protocol V3 handles a lot better connections / threads from clients.
>> Having an heavy throughput like yours, you might want to give this a try.
>>
>
> This one is a good idea and I'll probably try increasing it, but I don't
> really see these back up so.
>
>
>>
>> What is your current client ?
>> What does "netstat -an | grep -e 9042 -e 9160 | grep ESTABLISHED | wc -l"
>> outputs ? This is the number of clients connected to the node.
>> Do you have other significant errors or warning in the logs (other than
>> dropped mutations)? "grep -i -e "ERROR" -e "WARN"
>> /var/log/cassandra/system.log"
>>
>
> 435 incoming connections, only warning is compaction of some large
> partitions.
>
>
>>
>> As a small conclusion I would have an eye on things related to the memory
>> management and also trying to push Cassandra limits by increasing default
>> values as you seems to have resources available, to make sure Cassandra can
>> cope with the high throughput. Pending operations = high memory pressure.
>> Reducing pending stuff somehow will probably get you out off troubles.
>>
>> Hope this first round of ideas will help you.
>>
>> C*heers,
>> -----------------------
>> Alain Rodriguez - al...@thelastpickle.com
>> France
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> 2016-03-02 22:58 GMT+01:00 Dan Kinder <dkin...@turnitin.com>:
>>
>>> Also should note: Cassandra 2.2.5, Centos 6.7
>>>
>>> On Wed, Mar 2, 2016 at 1:34 PM, Dan Kinder <dkin...@turnitin.com> wrote:
>>>
>>>> Hi y'all,
>>>>
>>>> I am writing to a cluster fairly fast and seeing this odd behavior
>>>> happen, seemingly to single nodes at a time. The node starts to take more
>>>> and more memory (instance has 48GB memory on G1GC). tpstats shows that
>>>> MemtableReclaimMemory Pending starts to grow first, then later
>>>> MutationStage builds up as well. By then most of the memory is being
>>>> consumed, GC is getting longer, node slows down and everything slows down
>>>> unless I kill the node. Also the number of Active MemtableReclaimMemory
>>>> threads seems to stay at 1. Also interestingly, neither CPU nor disk
>>>> utilization are pegged while this is going on; it's on jbod and there is
>>>> plenty of headroom there. (Note that there is a decent number of
>>>> compactions going on as well but that is expected on these nodes and this
>>>> particular one is catching up from a high volume of writes).
>>>>
>>>> Anyone have any theories on why this would be happening?
>>>>
>>>>
>>>> $ nodetool tpstats
>>>> Pool Name                    Active   Pending      Completed   Blocked
>>>>  All time blocked
>>>> MutationStage                   192    715481      311327142         0
>>>>                 0
>>>> ReadStage                         7         0        9142871         0
>>>>                 0
>>>> RequestResponseStage              1         0      690823199         0
>>>>                 0
>>>> ReadRepairStage                   0         0        2145627         0
>>>>                 0
>>>> CounterMutationStage              0         0              0         0
>>>>                 0
>>>> HintedHandoff                     0         0            144         0
>>>>                 0
>>>> MiscStage                         0         0              0         0
>>>>                 0
>>>> CompactionExecutor               12        24          41022         0
>>>>                 0
>>>> MemtableReclaimMemory             1       102           4263         0
>>>>                 0
>>>> PendingRangeCalculator            0         0             10         0
>>>>                 0
>>>> GossipStage                       0         0         148329         0
>>>>                 0
>>>> MigrationStage                    0         0              0         0
>>>>                 0
>>>> MemtablePostFlush                 0         0           5233         0
>>>>                 0
>>>> ValidationExecutor                0         0              0         0
>>>>                 0
>>>> Sampler                           0         0              0         0
>>>>                 0
>>>> MemtableFlushWriter               0         0           4270         0
>>>>                 0
>>>> InternalResponseStage             0         0       16322698         0
>>>>                 0
>>>> AntiEntropyStage                  0         0              0         0
>>>>                 0
>>>> CacheCleanupExecutor              0         0              0         0
>>>>                 0
>>>> Native-Transport-Requests        25         0      547935519         0
>>>>           2586907
>>>>
>>>> Message type           Dropped
>>>> READ                         0
>>>> RANGE_SLICE                  0
>>>> _TRACE                       0
>>>> MUTATION                287057
>>>> COUNTER_MUTATION             0
>>>> REQUEST_RESPONSE             0
>>>> PAGED_RANGE                  0
>>>> READ_REPAIR                149
>>>>
>>>>
>>>
>>>
>>> --
>>> Dan Kinder
>>> Principal Software Engineer
>>> Turnitin – www.turnitin.com
>>> dkin...@turnitin.com
>>>
>>
>

Reply via email to