Also about your traces, and according to Jeff in another thread: the incomplete sstable will be deleted during startup (in 3.0 and newer > there’s a transaction log of each compaction in progress - that gets > cleaned during the startup process) >
maybe that's what you are seeing? Again, I'm not really familiar with those traces. I find traces and debug pretty useless (or even counter-productive) in 99% of the cases, so I don't use them much. Le jeu. 20 juin 2019 à 12:25, Alain RODRIGUEZ <arodr...@gmail.com> a écrit : > Hello Asad, > > >> I’m on environment with apache Cassandra 3.11.1 with java 1.8.0_144. > > One Node went OOM and crashed. > > > If I remember well, firsts minor versions of C* 3.11 have memory leaks. It > seems it was fixed in your version though. > > 3.11.1 > > [...] > > * BTree.Builder memory leak (CASSANDRA-13754) > > > Yet other improvements were made later on: > > >> 3.11.3 > > [...] > > * Remove BTree.Builder Recycler to reduce memory usage (CASSANDRA-13929) > > * Reduce nodetool GC thread count (CASSANDRA-14475) > > > See: https://github.com/apache/cassandra/blob/cassandra-3.11/CHANGES.txt. > Before digging more I would upgrade to 3.11.latest (latest = 4 or 5 I > guess), because early versions of a major Cassandra versions are famous for > being quite broken, even though this major is a 'bug fix only' branch. > Also minor versions upgrades are not too risky to go through. I would > maybe start there if you're not too sure how to dig this. > > If it happens again or you don't want to upgrade, it would be interesting > to know: > - if the OOM happens inside the JVM or on native memory (then the OS > would be the one sending the kill signal). These 2 issues have different > (and sometime opposite) fixes. > - What's the host size (especially memory) and how the heap (and maybe > some off heap structures) are configured (at least what is not default). > - If you saw errors in the logs and what the 'nodetool tpstats' was > looking like when the node went down (it might have been dumped in the logs) > > I don't know much about those traces nor why Cassandra would take a long > time. Though they are traces and harder to interpret for me. What does the > INFO / WARN / ERR look like? > Maybe opening a lot of SSTables and/or replaying a lot of commit logs, > given the nature of the restart (post outage)? > To speed up things, when nodes are not crashing, under normal > circumstances, use 'nodetool drain' as part of stopping the node, before > stopping/killing the service/process. > > C*heers, > ----------------------- > Alain Rodriguez - al...@thelastpickle.com > France / Spain > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > Le mar. 18 juin 2019 à 23:43, ZAIDI, ASAD A <az1...@att.com> a écrit : > >> >> >> I’m on environment with apache Cassandra 3.11.1 with java 1.8.0_144. >> >> >> >> One Node went OOM and crashed. Re-starting this crashed node is taking >> long time. Trace level debug log is showing messages like: >> >> >> >> >> >> Debug.log trace excerpt: >> >> ======================== >> >> >> >> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting >> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-CompressionInfo.db >> >> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting >> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-Filter.db >> >> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting >> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-TOC.txt >> >> TRACE [main] 2019-06-18 21:30:43,455 LogTransaction.java:217 - Deleting >> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_642976c0-91c3-11e9-97bb-6b1dee397c3f.log >> >> TRACE [main] 2019-06-18 21:30:43,458 LogReplicaSet.java:67 - Added log >> file replica >> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_5a6c8c90-91cc-11e9-97bb-6b1dee397c3f.log >> >> >> >> >> >> Above messages are repeated for unique [mc-nnnn-* ] files. Such messages >> are repeating constantly. >> >> >> >> I’m seeking help here to find out what may be going on here , any hint to >> root cause and how I can quickly start the node. Thanks in advance. >> >> >> >> Regards/asad >> >> >> >> >> >> >> >