Yeah, I looked through your stack trace and saw it wasn't the same thing, but the steps to identify the root cause should be the same.
I nuked ZFS from orbit :) This was happening across all the machines at various times in the cluster, and we haven't seen a single issue since switching to XFS. Thanks for the advice though, I'll keep it in mind if I encounter it again. Jon On Tue, Nov 5, 2024 at 9:18 AM Bowen Song via user < user@cassandra.apache.org> wrote: > Hi Jon, > > That is interesting. We happen to be running Cassandra on ZFS. However we > have not had any incident for years with this setup, the only change is the > recent addition of CDC. > > I can see that in CASSANDRA-19564, the MemtablePostFlush thread was stuck > on the unlink() syscall. But in our case, it was stuck here: > > "MemtablePostFlush:1" daemon prio=5 tid=237 WAITING > at jdk.internal.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) > at > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) > local variable: > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$RegisteredSignal#1 > at > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) > at > org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306) > local variable: > org.apache.cassandra.utils.concurrent.CountDownLatch$Async#7 > at > org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:338) > at > org.apache.cassandra.db.ColumnFamilyStore$PostFlush.call(ColumnFamilyStore.java:1094) > local variable: > org.apache.cassandra.db.ColumnFamilyStore$PostFlush#7 > at > org.apache.cassandra.db.ColumnFamilyStore$PostFlush.call(ColumnFamilyStore.java:1077) > at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47) > at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57) > local variable: org.apache.cassandra.concurrent.FutureTask#2555 > local variable: org.apache.cassandra.concurrent.FutureTask#2555 > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > local variable: > org.apache.cassandra.concurrent.SingleThreadExecutorPlus#5 > local variable: java.util.concurrent.ThreadPoolExecutor$Worker#9 > local variable: io.netty.util.concurrent.FastThreadLocalThread#11 > local variable: org.apache.cassandra.concurrent.FutureTask#2555 > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > local variable: java.util.concurrent.ThreadPoolExecutor$Worker#9 > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > local variable: io.netty.util.concurrent.FastThreadLocalRunnable#11 > at java.lang.Thread.run(Thread.java:829) > local variable: io.netty.util.concurrent.FastThreadLocalThread#11 > > Which made me believe that they are not the same issue. > > Thanks for the info though. > > BTW, if you still have access to that oddly behaving ZFS filesystem, have > a look at the free space (should have at least 20% free) and the > fragmentation ratio in "zpool list". If neither is an issue, check whether > deduplication is enabled or there's a large number of snapshots. These can > have significant impact on file deletion performance on ZFS. Also worth > checking the disks, I have seen broken disks that stuck on some operations, > e.g. when a specific sector is being read, and this will certainly affect > the filesystem behaviour. > > Cheers, > Bowen > On 05/11/2024 16:41, Jon Haddad wrote: > > I ran into this a few months ago, and in my case I tracked it down to an > issue with ZFS not unlinking commitlogs properly. > > https://issues.apache.org/jira/browse/CASSANDRA-19564 > > On Tue, Nov 5, 2024 at 6:05 AM Dmitry Konstantinov <netud...@gmail.com> > wrote: > >> I am speaking about a thread dump (stack traces for all threads), not a >> heap dump. The heap dump should contain thread stacks info. >> Thread dump (stack traces) is small and does not have sensitive info. >> >> Regards, >> Dmitry >> >> On Tue, 5 Nov 2024 at 13:53, Bowen Song via user < >> user@cassandra.apache.org> wrote: >> >>> It's about 18GB in size and may contain a huge amount of sensitive data >>> (e.g. all the pending writes), so I can't share it. However, if there's any >>> particular piece of information you would like to have, I'm more than happy >>> to extract the info from the dump and and share it here. >>> On 05/11/2024 13:01, Dmitry Konstantinov wrote: >>> >>> Hi Bowen, would it be possible to share a full thread dump? >>> >>> Regards, >>> Dmitry >>> >>> On Tue, 5 Nov 2024 at 12:12, Bowen Song via user < >>> user@cassandra.apache.org> wrote: >>> >>>> Hi all, >>>> >>>> We have a cluster running Cassandra 4.1.1. We are seeing the memtable >>>> flush randomly getting stuck. This has happened twice in the last 10 days, >>>> to two different nodes in the same cluster. This started to happen after we >>>> enabled CDC, and each time it got stuck, there was at least one repair >>>> running involving the affected node. >>>> >>>> The signs of the stuck memtable flush is most obvious from the >>>> "StatusLogger" logs. >>>> >>>> At the beginning, the MemtablePostFlush and MemtableFlushWriter got >>>> stuck, they have 1 and 2 active tasks each, and a small number of pending >>>> tasks. >>>> >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:65 - >>>> Pool Name Active Pending Completed Blocked >>>> All Time Blocked >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> ReadStage 0 0 34052333 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> CompactionExecutor 0 0 1019777 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> MutationStage 0 0 14930764 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> MemtableReclaimMemory 0 0 21877 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> PendingRangeCalculator 0 0 177 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> Repair#61 0 0 1344 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> GossipStage 0 0 889452 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> SecondaryIndexManagement 0 0 1 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> HintsDispatcher 0 0 19 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> Repair-Task 0 0 65 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> RequestResponseStage 0 0 44447834 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> Native-Transport-Requests 0 0 8967921 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> MigrationStage 0 0 5 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> MemtableFlushWriter 2 10 21781 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> MemtablePostFlush 1 11 47856 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> PerDiskMemtableFlushWriter_0 0 0 21769 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> Sampler 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> ValidationExecutor 0 0 36651 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> ViewBuildExecutor 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> InternalResponseStage 0 0 240 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> AntiEntropyStage 1 1773 120067 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 - >>>> CacheCleanupExecutor 0 0 0 >>>> 0 0 >>>> >>>> The number of pending tasks slowly grows larger over time, and the >>>> number of completed tasks do not increase at all. >>>> >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:65 - >>>> Pool Name Active Pending Completed Blocked >>>> All Time Blocked >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> ReadStage 0 0 39905462 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> CompactionExecutor 0 0 1170100 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> MutationStage 0 0 16976992 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> Repair#76 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> Repair#74 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> Repair#72 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> GossipStage 0 0 1026877 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> Repair-Task 0 0 83 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> RequestResponseStage 0 0 51122191 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> ValidationExecutor 0 0 36651 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> MemtablePostFlush 1 54 47856 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> MemtableFlushWriter 2 53 21781 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> CacheCleanupExecutor 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> Repair#67 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> MemtableReclaimMemory 0 0 21877 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> PendingRangeCalculator 0 0 177 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> Repair#82 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> Repair#61 0 0 1344 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> SecondaryIndexManagement 0 0 1 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> HintsDispatcher 0 0 20 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> Native-Transport-Requests 0 0 10145020 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> MigrationStage 0 0 5 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> Sampler 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 - >>>> PerDiskMemtableFlushWriter_0 0 0 21769 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,942 StatusLogger.java:69 - >>>> ViewBuildExecutor 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,942 StatusLogger.java:69 - >>>> InternalResponseStage 0 0 996 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 16:33:05,942 StatusLogger.java:69 - >>>> AntiEntropyStage 1 9694 120067 >>>> 0 0 >>>> >>>> Then, things start to get worse, the MutationStage also got stuck. >>>> >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:65 - >>>> Pool Name Active Pending Completed Blocked >>>> All Time Blocked >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> ReadStage 0 0 40186426 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> CompactionExecutor 0 0 1171479 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> MutationStage 128 201414 16988212 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> Repair#76 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> Repair#74 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> Repair#72 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> GossipStage 0 0 1032988 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> Repair-Task 0 0 85 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> RequestResponseStage 0 0 51307497 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> ValidationExecutor 0 0 36651 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> MemtablePostFlush 1 54 47856 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> MemtableFlushWriter 2 53 21781 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> CacheCleanupExecutor 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> Repair#67 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> MemtableReclaimMemory 0 0 21877 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> Repair#84 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> PendingRangeCalculator 0 0 177 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> Repair#82 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> Repair#61 0 0 1344 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> SecondaryIndexManagement 0 0 1 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> HintsDispatcher 0 0 20 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> Native-Transport-Requests 0 0 10194549 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> MigrationStage 0 0 5 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> Sampler 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> PerDiskMemtableFlushWriter_0 0 0 21769 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> ViewBuildExecutor 0 0 0 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> InternalResponseStage 0 0 1087 >>>> 0 0 >>>> INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 - >>>> AntiEntropyStage 1 10364 120067 >>>> 0 0 >>>> >>>> Writes on this node starts to timeout and fail. But if left untouched, >>>> it's only gonna get worse, and eventually lead to JVM OOM and crash. >>>> >>>> By inspecting the heap dump created at OOM, we can see that both of the >>>> MemtableFlushWriter threads are stuck on line 1190 >>>> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190> >>>> in the ColumnFamilyStore.java: >>>> >>>> // mark writes older than the barrier as blocking progress, >>>> permitting them to exceed our memory limit >>>> // if they are stuck waiting on it, then wait for them all >>>> to complete >>>> writeBarrier.markBlocking(); >>>> writeBarrier.await(); // <----------- stuck here >>>> >>>> And the MemtablePostFlush thread is stuck on line 1094 >>>> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094> >>>> in the same file. >>>> >>>> try >>>> { >>>> // we wait on the latch for the commitLogUpperBound to >>>> be set, and so that waiters >>>> // on this task can rely on all prior flushes being >>>> complete >>>> latch.await(); // <----------- stuck here >>>> } >>>> Our top suspect is CDC interacting with repair, since this started to >>>> happen shortly after we enabled CDC on the nodes, and each time repair was >>>> running. But we have not been able to reproduce this in a testing cluster, >>>> and don't know what's the next step to troubleshoot this issue. So I'm >>>> posting it in the mailing lists and hoping someone may know something about >>>> it or point me to the right direction. >>>> >>>> p.s.: sorry about posting this to both the user & dev mailing lists. >>>> It's an end-user related issue but involves Cassandra internals, so I can't >>>> decide which one is best suited. >>>> >>>> Cheers, >>>> Bowen >>>> >>>> >>>> >>> >>> -- >>> Dmitry Konstantinov >>> >>> >> >> -- >> Dmitry Konstantinov >> >