Re: node re-start delays , busy Deleting mc-txn-compaction/ Adding log file replica

Alain RODRIGUEZ Thu, 20 Jun 2019 03:29:31 -0700

Also about your traces, and according to Jeff in another thread:

the incomplete sstable will be deleted during startup (in 3.0 and newer
> there’s a transaction log of each compaction in progress - that gets
> cleaned during the startup process)
>


maybe that's what you are seeing? Again, I'm not really familiar with those
traces. I find traces and debug pretty useless (or even counter-productive)
in 99% of the cases, so I don't use them much.

Le jeu. 20 juin 2019 à 12:25, Alain RODRIGUEZ <arodr...@gmail.com> a écrit :

> Hello Asad,
>
>
>> I’m on environment with  apache Cassandra 3.11.1 with  java 1.8.0_144.
>
> One Node went OOM and crashed.
>
>
> If I remember well, firsts minor versions of C* 3.11 have memory leaks. It
> seems it was fixed in your version though.
>
> 3.11.1
>
> [...]
>
>  * BTree.Builder memory leak (CASSANDRA-13754)
>
>
> Yet other improvements were made later on:
>
>
>> 3.11.3
>
> [...]
>
>  * Remove BTree.Builder Recycler to reduce memory usage (CASSANDRA-13929)
>
>  * Reduce nodetool GC thread count (CASSANDRA-14475)
>
>
> See: https://github.com/apache/cassandra/blob/cassandra-3.11/CHANGES.txt.
> Before digging more I would upgrade to 3.11.latest (latest = 4 or 5 I
> guess), because early versions of a major Cassandra versions are famous for
> being quite broken, even though this major is a 'bug fix only' branch.
> Also minor versions upgrades are not too risky to go through. I would
> maybe start there if you're not too sure how to dig this.
>
> If it happens again or you don't want to upgrade, it would be interesting
> to know:
> -  if the OOM happens inside the JVM or on native memory (then the OS
> would be the one sending the kill signal). These 2 issues have different
> (and sometime opposite) fixes.
> - What's the host size (especially memory) and how the heap (and maybe
> some off heap structures) are configured (at least what is not default).
> - If you saw errors in the logs and what the 'nodetool tpstats' was
> looking like when the node went down (it might have been dumped in the logs)
>
> I don't know much about those traces nor why Cassandra would take a long
> time. Though they are traces and harder to interpret for me. What does the
> INFO / WARN / ERR look like?
> Maybe opening a lot of SSTables and/or replaying a lot of commit logs,
> given the nature of the restart (post outage)?
> To speed up things, when nodes are not crashing, under normal
> circumstances, use 'nodetool drain' as part of stopping the node, before
> stopping/killing the service/process.
>
> C*heers,
> -----------------------
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 18 juin 2019 à 23:43, ZAIDI, ASAD A <az1...@att.com> a écrit :
>
>>
>>
>> I’m on environment with  apache Cassandra 3.11.1 with  java 1.8.0_144.
>>
>>
>>
>> One Node went OOM and crashed. Re-starting this crashed node is taking
>> long time. Trace level debug log is showing messages like:
>>
>>
>>
>>
>>
>> Debug.log trace excerpt:
>>
>> ========================
>>
>>
>>
>> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
>> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-CompressionInfo.db
>>
>> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
>> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-Filter.db
>>
>> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
>> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-TOC.txt
>>
>> TRACE [main] 2019-06-18 21:30:43,455 LogTransaction.java:217 - Deleting
>> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_642976c0-91c3-11e9-97bb-6b1dee397c3f.log
>>
>> TRACE [main] 2019-06-18 21:30:43,458 LogReplicaSet.java:67 - Added log
>> file replica
>> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_5a6c8c90-91cc-11e9-97bb-6b1dee397c3f.log
>>
>>
>>
>>
>>
>> Above messages are repeated for unique [mc-nnnn-* ] files. Such messages
>> are repeating constantly.
>>
>>
>>
>> I’m seeking help here to find out what may be going on here , any hint to
>> root cause and how I can quickly start the node. Thanks in advance.
>>
>>
>>
>> Regards/asad
>>
>>
>>
>>
>>
>>
>>
>

Re: node re-start delays , busy Deleting mc-txn-compaction/ Adding log file replica

Reply via email to