I am speaking about a thread dump (stack traces for all threads), not a heap dump. The heap dump should contain thread stacks info. Thread dump (stack traces) is small and does not have sensitive info.
Regards, Dmitry On Tue, 5 Nov 2024 at 13:53, Bowen Song via user <user@cassandra.apache.org> wrote: > It's about 18GB in size and may contain a huge amount of sensitive data > (e.g. all the pending writes), so I can't share it. However, if there's any > particular piece of information you would like to have, I'm more than happy > to extract the info from the dump and and share it here. > On 05/11/2024 13:01, Dmitry Konstantinov wrote: > > Hi Bowen, would it be possible to share a full thread dump? > > Regards, > Dmitry > > On Tue, 5 Nov 2024 at 12:12, Bowen Song via user < > user@cassandra.apache.org> wrote: > >> Hi all, >> >> We have a cluster running Cassandra 4.1.1. We are seeing the memtable >> flush randomly getting stuck. This has happened twice in the last 10 days, >> to two different nodes in the same cluster. This started to happen after we >> enabled CDC, and each time it got stuck, there was at least one repair >> running involving the affected node. >> >> The signs of the stuck memtable flush is most obvious from the >> "StatusLogger" logs. >> >> At the beginning, the MemtablePostFlush and MemtableFlushWriter got >> stuck, they have 1 and 2 active tasks each, and a small number of pending >> tasks. >> >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:65 - >> Pool Name Active Pending Completed Blocked >> All Time Blocked >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> ReadStage 0 0 34052333 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> CompactionExecutor 0 0 1019777 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> MutationStage 0 0 14930764 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> MemtableReclaimMemory 0 0 21877 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> PendingRangeCalculator 0 0 177 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> Repair#61 0 0 1344 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> GossipStage 0 0 889452 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> SecondaryIndexManagement 0 0 1 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> HintsDispatcher 0 0 19 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> Repair-Task 0 0 65 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> RequestResponseStage 0 0 44447834 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> Native-Transport-Requests 0 0 8967921 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> MigrationStage 0 0 5 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> MemtableFlushWriter 2 10 21781 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> MemtablePostFlush 1 11 47856 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> PerDiskMemtableFlushWriter_0 0 0 21769 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> Sampler 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> ValidationExecutor 0 0 36651 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> ViewBuildExecutor 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> InternalResponseStage 0 0 240 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> AntiEntropyStage 1 1773 120067 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >> CacheCleanupExecutor 0 0 0 >> 0 0 >> >> The number of pending tasks slowly grows larger over time, and the number >> of completed tasks do not increase at all. >> >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:65 - >> Pool Name Active Pending Completed Blocked >> All Time Blocked >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> ReadStage 0 0 39905462 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> CompactionExecutor 0 0 1170100 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> MutationStage 0 0 16976992 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> Repair#76 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> Repair#74 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> Repair#72 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> GossipStage 0 0 1026877 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> Repair-Task 0 0 83 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> RequestResponseStage 0 0 51122191 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> ValidationExecutor 0 0 36651 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> MemtablePostFlush 1 54 47856 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> MemtableFlushWriter 2 53 21781 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> CacheCleanupExecutor 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> Repair#67 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> MemtableReclaimMemory 0 0 21877 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> PendingRangeCalculator 0 0 177 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> Repair#82 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> Repair#61 0 0 1344 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> SecondaryIndexManagement 0 0 1 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> HintsDispatcher 0 0 20 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> Native-Transport-Requests 0 0 10145020 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> MigrationStage 0 0 5 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> Sampler 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >> PerDiskMemtableFlushWriter_0 0 0 21769 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,942 StatusLogger.java:69 - >> ViewBuildExecutor 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,942 StatusLogger.java:69 - >> InternalResponseStage 0 0 996 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,942 StatusLogger.java:69 - >> AntiEntropyStage 1 9694 120067 >> 0 0 >> >> Then, things start to get worse, the MutationStage also got stuck. >> >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:65 - >> Pool Name Active Pending Completed Blocked >> All Time Blocked >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> ReadStage 0 0 40186426 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> CompactionExecutor 0 0 1171479 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> MutationStage 128 201414 16988212 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> Repair#76 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> Repair#74 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> Repair#72 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> GossipStage 0 0 1032988 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> Repair-Task 0 0 85 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> RequestResponseStage 0 0 51307497 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> ValidationExecutor 0 0 36651 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> MemtablePostFlush 1 54 47856 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> MemtableFlushWriter 2 53 21781 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> CacheCleanupExecutor 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> Repair#67 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> MemtableReclaimMemory 0 0 21877 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> Repair#84 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> PendingRangeCalculator 0 0 177 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> Repair#82 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> Repair#61 0 0 1344 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> SecondaryIndexManagement 0 0 1 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> HintsDispatcher 0 0 20 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> Native-Transport-Requests 0 0 10194549 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> MigrationStage 0 0 5 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> Sampler 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> PerDiskMemtableFlushWriter_0 0 0 21769 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> ViewBuildExecutor 0 0 0 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> InternalResponseStage 0 0 1087 >> 0 0 >> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >> AntiEntropyStage 1 10364 120067 >> 0 0 >> >> Writes on this node starts to timeout and fail. But if left untouched, >> it's only gonna get worse, and eventually lead to JVM OOM and crash. >> >> By inspecting the heap dump created at OOM, we can see that both of the >> MemtableFlushWriter threads are stuck on line 1190 >> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190> >> in the ColumnFamilyStore.java: >> >> // mark writes older than the barrier as blocking progress, >> permitting them to exceed our memory limit >> // if they are stuck waiting on it, then wait for them all to >> complete >> writeBarrier.markBlocking(); >> writeBarrier.await(); // <----------- stuck here >> >> And the MemtablePostFlush thread is stuck on line 1094 >> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094> >> in the same file. >> >> try >> { >> // we wait on the latch for the commitLogUpperBound to be >> set, and so that waiters >> // on this task can rely on all prior flushes being >> complete >> latch.await(); // <----------- stuck here >> } >> Our top suspect is CDC interacting with repair, since this started to >> happen shortly after we enabled CDC on the nodes, and each time repair was >> running. But we have not been able to reproduce this in a testing cluster, >> and don't know what's the next step to troubleshoot this issue. So I'm >> posting it in the mailing lists and hoping someone may know something about >> it or point me to the right direction. >> >> p.s.: sorry about posting this to both the user & dev mailing lists. It's >> an end-user related issue but involves Cassandra internals, so I can't >> decide which one is best suited. >> >> Cheers, >> Bowen >> >> >> > > -- > Dmitry Konstantinov > > -- Dmitry Konstantinov