Hello Asad,

> I’m on environment with  apache Cassandra 3.11.1 with  java 1.8.0_144.

One Node went OOM and crashed.


If I remember well, firsts minor versions of C* 3.11 have memory leaks. It
seems it was fixed in your version though.

3.11.1

[...]

 * BTree.Builder memory leak (CASSANDRA-13754)


Yet other improvements were made later on:


> 3.11.3

[...]

 * Remove BTree.Builder Recycler to reduce memory usage (CASSANDRA-13929)

 * Reduce nodetool GC thread count (CASSANDRA-14475)


See: https://github.com/apache/cassandra/blob/cassandra-3.11/CHANGES.txt.
Before digging more I would upgrade to 3.11.latest (latest = 4 or 5 I
guess), because early versions of a major Cassandra versions are famous for
being quite broken, even though this major is a 'bug fix only' branch.
Also minor versions upgrades are not too risky to go through. I would maybe
start there if you're not too sure how to dig this.

If it happens again or you don't want to upgrade, it would be interesting
to know:
-  if the OOM happens inside the JVM or on native memory (then the OS would
be the one sending the kill signal). These 2 issues have different (and
sometime opposite) fixes.
- What's the host size (especially memory) and how the heap (and maybe some
off heap structures) are configured (at least what is not default).
- If you saw errors in the logs and what the 'nodetool tpstats' was looking
like when the node went down (it might have been dumped in the logs)

I don't know much about those traces nor why Cassandra would take a long
time. Though they are traces and harder to interpret for me. What does the
INFO / WARN / ERR look like?
Maybe opening a lot of SSTables and/or replaying a lot of commit logs,
given the nature of the restart (post outage)?
To speed up things, when nodes are not crashing, under normal
circumstances, use 'nodetool drain' as part of stopping the node, before
stopping/killing the service/process.

C*heers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mar. 18 juin 2019 à 23:43, ZAIDI, ASAD A <az1...@att.com> a écrit :

>
>
> I’m on environment with  apache Cassandra 3.11.1 with  java 1.8.0_144.
>
>
>
> One Node went OOM and crashed. Re-starting this crashed node is taking
> long time. Trace level debug log is showing messages like:
>
>
>
>
>
> Debug.log trace excerpt:
>
> ========================
>
>
>
> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-CompressionInfo.db
>
> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-Filter.db
>
> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-TOC.txt
>
> TRACE [main] 2019-06-18 21:30:43,455 LogTransaction.java:217 - Deleting
> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_642976c0-91c3-11e9-97bb-6b1dee397c3f.log
>
> TRACE [main] 2019-06-18 21:30:43,458 LogReplicaSet.java:67 - Added log
> file replica
> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_5a6c8c90-91cc-11e9-97bb-6b1dee397c3f.log
>
>
>
>
>
> Above messages are repeated for unique [mc-nnnn-* ] files. Such messages
> are repeating constantly.
>
>
>
> I’m seeking help here to find out what may be going on here , any hint to
> root cause and how I can quickly start the node. Thanks in advance.
>
>
>
> Regards/asad
>
>
>
>
>
>
>

Reply via email to