Hello Asad,
> I’m on environment with apache Cassandra 3.11.1 with java 1.8.0_144. One Node went OOM and crashed. If I remember well, firsts minor versions of C* 3.11 have memory leaks. It seems it was fixed in your version though. 3.11.1 [...] * BTree.Builder memory leak (CASSANDRA-13754) Yet other improvements were made later on: > 3.11.3 [...] * Remove BTree.Builder Recycler to reduce memory usage (CASSANDRA-13929) * Reduce nodetool GC thread count (CASSANDRA-14475) See: https://github.com/apache/cassandra/blob/cassandra-3.11/CHANGES.txt. Before digging more I would upgrade to 3.11.latest (latest = 4 or 5 I guess), because early versions of a major Cassandra versions are famous for being quite broken, even though this major is a 'bug fix only' branch. Also minor versions upgrades are not too risky to go through. I would maybe start there if you're not too sure how to dig this. If it happens again or you don't want to upgrade, it would be interesting to know: - if the OOM happens inside the JVM or on native memory (then the OS would be the one sending the kill signal). These 2 issues have different (and sometime opposite) fixes. - What's the host size (especially memory) and how the heap (and maybe some off heap structures) are configured (at least what is not default). - If you saw errors in the logs and what the 'nodetool tpstats' was looking like when the node went down (it might have been dumped in the logs) I don't know much about those traces nor why Cassandra would take a long time. Though they are traces and harder to interpret for me. What does the INFO / WARN / ERR look like? Maybe opening a lot of SSTables and/or replaying a lot of commit logs, given the nature of the restart (post outage)? To speed up things, when nodes are not crashing, under normal circumstances, use 'nodetool drain' as part of stopping the node, before stopping/killing the service/process. C*heers, ----------------------- Alain Rodriguez - al...@thelastpickle.com France / Spain The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com Le mar. 18 juin 2019 à 23:43, ZAIDI, ASAD A <az1...@att.com> a écrit : > > > I’m on environment with apache Cassandra 3.11.1 with java 1.8.0_144. > > > > One Node went OOM and crashed. Re-starting this crashed node is taking > long time. Trace level debug log is showing messages like: > > > > > > Debug.log trace excerpt: > > ======================== > > > > TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting > /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-CompressionInfo.db > > TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting > /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-Filter.db > > TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting > /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-TOC.txt > > TRACE [main] 2019-06-18 21:30:43,455 LogTransaction.java:217 - Deleting > /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_642976c0-91c3-11e9-97bb-6b1dee397c3f.log > > TRACE [main] 2019-06-18 21:30:43,458 LogReplicaSet.java:67 - Added log > file replica > /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_5a6c8c90-91cc-11e9-97bb-6b1dee397c3f.log > > > > > > Above messages are repeated for unique [mc-nnnn-* ] files. Such messages > are repeating constantly. > > > > I’m seeking help here to find out what may be going on here , any hint to > root cause and how I can quickly start the node. Thanks in advance. > > > > Regards/asad > > > > > > >