Hi Jon,

That is interesting. We happen to be running Cassandra on ZFS. However we have not had any incident for years with this setup, the only change is the recent addition of CDC.

I can see that in CASSANDRA-19564, the MemtablePostFlush thread was stuck on the unlink() syscall. But in our case, it was stuck here:

"MemtablePostFlush:1" daemon prio=5 tid=237 WAITING
    at jdk.internal.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
    at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)        local variable: org.apache.cassandra.utils.concurrent.WaitQueue$Standard$RegisteredSignal#1     at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)     at org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)        local variable: org.apache.cassandra.utils.concurrent.CountDownLatch$Async#7     at org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:338)     at org.apache.cassandra.db.ColumnFamilyStore$PostFlush.call(ColumnFamilyStore.java:1094)        local variable: org.apache.cassandra.db.ColumnFamilyStore$PostFlush#7     at org.apache.cassandra.db.ColumnFamilyStore$PostFlush.call(ColumnFamilyStore.java:1077)
    at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
    at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
       local variable: org.apache.cassandra.concurrent.FutureTask#2555
       local variable: org.apache.cassandra.concurrent.FutureTask#2555
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)        local variable: org.apache.cassandra.concurrent.SingleThreadExecutorPlus#5
       local variable: java.util.concurrent.ThreadPoolExecutor$Worker#9
       local variable: io.netty.util.concurrent.FastThreadLocalThread#11
       local variable: org.apache.cassandra.concurrent.FutureTask#2555
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       local variable: java.util.concurrent.ThreadPoolExecutor$Worker#9
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       local variable: io.netty.util.concurrent.FastThreadLocalRunnable#11
    at java.lang.Thread.run(Thread.java:829)
       local variable: io.netty.util.concurrent.FastThreadLocalThread#11

Which made me believe that they are not the same issue.

Thanks for the info though.

BTW, if you still have access to that oddly behaving ZFS filesystem, have a look at the free space (should have at least 20% free) and the fragmentation ratio in "zpool list". If neither is an issue, check whether deduplication is enabled or there's a large number of snapshots. These can have significant impact on file deletion performance on ZFS. Also worth checking the disks, I have seen broken disks that stuck on some operations, e.g. when a specific sector is being read, and this will certainly affect the filesystem behaviour.

Cheers,
Bowen

On 05/11/2024 16:41, Jon Haddad wrote:
I ran into this a few months ago, and in my case I tracked it down to an issue with ZFS not unlinking commitlogs properly.

https://issues.apache.org/jira/browse/CASSANDRA-19564

On Tue, Nov 5, 2024 at 6:05 AM Dmitry Konstantinov <netud...@gmail.com> wrote:

    I am speaking about a thread dump (stack traces for all threads),
    not a heap dump. The heap dump should contain thread stacks info.
    Thread dump (stack traces) is small and does not have sensitive info.

    Regards,
    Dmitry

    On Tue, 5 Nov 2024 at 13:53, Bowen Song via user
    <user@cassandra.apache.org> wrote:

        It's about 18GB in size and may contain a huge amount of
        sensitive data (e.g. all the pending writes), so I can't share
        it. However, if there's any particular piece of information
        you would like to have, I'm more than happy to extract the
        info from the dump and and share it here.

        On 05/11/2024 13:01, Dmitry Konstantinov wrote:
        Hi Bowen, would it be possible to share a full thread dump?

        Regards,
        Dmitry

        On Tue, 5 Nov 2024 at 12:12, Bowen Song via user
        <user@cassandra.apache.org> wrote:

            Hi all,

            We have a cluster running Cassandra 4.1.1. We are seeing
            the memtable flush randomly getting stuck. This has
            happened twice in the last 10 days, to two different
            nodes in the same cluster. This started to happen after
            we enabled CDC, and each time it got stuck, there was at
            least one repair running involving the affected node.

            The signs of the stuck memtable flush is most obvious
            from the "StatusLogger" logs.

            At the beginning, the MemtablePostFlush and
            MemtableFlushWriter got stuck, they have 1 and 2 active
            tasks each, and a small number of pending tasks.

            INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:65 - Pool Name                      
            Active Pending      Completed   Blocked  All Time Blocked
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - ReadStage 0         0      
            34052333 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - CompactionExecutor 0        
            0        1019777 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - MutationStage 0         0      
            14930764 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - MemtableReclaimMemory 0        
            0          21877 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - PendingRangeCalculator 0        
            0            177 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - Repair#61 0         0          
            1344 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - GossipStage 0         0        
            889452 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - SecondaryIndexManagement 0        
            0              1 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - HintsDispatcher 0        
            0             19 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - Repair-Task 0        
            0             65 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - RequestResponseStage 0        
            0       44447834 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - Native-Transport-Requests
            0         0        8967921 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - MigrationStage 0        
            0              5 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - MemtableFlushWriter 2       
            10          21781 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - MemtablePostFlush 1       
            11          47856 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - PerDiskMemtableFlushWriter_0
            0         0          21769 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - Sampler 0         0              0
            0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - ValidationExecutor 0        
            0          36651 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - ViewBuildExecutor 0        
            0              0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - InternalResponseStage 0        
            0            240 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 -
            AntiEntropyStage                     1 1773        
            120067 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
            StatusLogger.java:69 - CacheCleanupExecutor 0        
            0              0 0                 0

            The number of pending tasks slowly grows larger over
            time, and the number of completed tasks do not increase
            at all.

            INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:65 - Pool Name                      
            Active Pending      Completed   Blocked  All Time Blocked
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - ReadStage 0         0      
            39905462 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - CompactionExecutor 0        
            0        1170100 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - MutationStage 0         0      
            16976992 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - Repair#76 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - Repair#74 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - Repair#72 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - GossipStage 0         0       
            1026877 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - Repair-Task 0        
            0             83 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - RequestResponseStage 0        
            0       51122191 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - ValidationExecutor 0        
            0          36651 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - MemtablePostFlush 1       
            54          47856 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - MemtableFlushWriter 2       
            53          21781 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - CacheCleanupExecutor 0        
            0              0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - Repair#67 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - MemtableReclaimMemory 0        
            0          21877 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - PendingRangeCalculator 0        
            0            177 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - Repair#82 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - Repair#61 0         0          
            1344 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - SecondaryIndexManagement 0        
            0              1 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - HintsDispatcher 0        
            0             20 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - Native-Transport-Requests
            0         0       10145020 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - MigrationStage 0        
            0              5 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - Sampler 0         0              0
            0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
            StatusLogger.java:69 - PerDiskMemtableFlushWriter_0
            0         0          21769 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,942
            StatusLogger.java:69 - ViewBuildExecutor 0        
            0              0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,942
            StatusLogger.java:69 - InternalResponseStage 0        
            0            996 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,942
            StatusLogger.java:69 -
            AntiEntropyStage                     1 9694        
            120067 0                 0

            Then, things start to get worse, the MutationStage also
            got stuck.

            INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:65 - Pool Name                      
            Active Pending      Completed   Blocked  All Time Blocked
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - ReadStage 0         0      
            40186426 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - CompactionExecutor 0        
            0        1171479 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - MutationStage                     
            128 201414       16988212 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - Repair#76 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - Repair#74 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - Repair#72 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - GossipStage 0         0       
            1032988 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - Repair-Task 0        
            0             85 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - RequestResponseStage 0        
            0       51307497 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - ValidationExecutor 0        
            0          36651 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - MemtablePostFlush 1       
            54          47856 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - MemtableFlushWriter 2       
            53          21781 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - CacheCleanupExecutor 0        
            0              0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - Repair#67 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - MemtableReclaimMemory 0        
            0          21877 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - Repair#84 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - PendingRangeCalculator 0        
            0            177 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - Repair#82 0         0             
            0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - Repair#61 0         0          
            1344 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - SecondaryIndexManagement 0        
            0              1 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - HintsDispatcher 0        
            0             20 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - Native-Transport-Requests
            0         0       10194549 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - MigrationStage 0        
            0              5 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - Sampler 0         0              0
            0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - PerDiskMemtableFlushWriter_0
            0         0          21769 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - ViewBuildExecutor 0        
            0              0 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 - InternalResponseStage 0        
            0           1087 0                 0
            INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
            StatusLogger.java:69 -
            AntiEntropyStage                     1 10364        
            120067 0                 0

            Writes on this node starts to timeout and fail. But if
            left untouched, it's only gonna get worse, and eventually
            lead to JVM OOM and crash.

            By inspecting the heap dump created at OOM, we can see
            that both of the MemtableFlushWriter threads are stuck on
            line 1190
            
<https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190>
            in the ColumnFamilyStore.java:

                        // mark writes older than the barrier as
            blocking progress, permitting them to exceed our memory limit
                        // if they are stuck waiting on it, then wait
            for them all to complete
                        writeBarrier.markBlocking();
                        writeBarrier.await();   // <----------- stuck
            here

            And the MemtablePostFlush thread is stuck on line 1094
            
<https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094>
            in the same file.

                        try
                        {
                            // we wait on the latch for the
            commitLogUpperBound to be set, and so that waiters
                            // on this task can rely on all prior
            flushes being complete
                            latch.await();   // <----------- stuck here
                        }
            Our top suspect is CDC interacting with repair, since
            this started to happen shortly after we enabled CDC on
            the nodes, and each time repair was running. But we have
            not been able to reproduce this in a testing cluster, and
            don't know what's the next step to troubleshoot this
            issue. So I'm posting it in the mailing lists and hoping
            someone may know something about it or point me to the
            right direction.

            p.s.: sorry about posting this to both the user & dev
            mailing lists. It's an end-user related issue but
            involves Cassandra internals, so I can't decide which one
            is best suited.

            Cheers,
            Bowen




-- Dmitry Konstantinov



-- Dmitry Konstantinov

Reply via email to