Funny enough, we used to run on ext4 and XFS on mdarray RAID1, but the crappy disks we had (and still have) randomly spitting out garbage data every once in a while. We suspected it's a firmware bug but unable to confirm or reliably reproduce it. Other than this behaviour, those disks work fine. Since mdarray can't tell which copy of the data is good, it has a 50% chance overwriting the good data with the garbage, usually resulting in SSTable corruptions. Since we switched to ZFS, never had an issue with that again. ZFS saved countless hours of replacing nodes, and saved the replacement cost of all the crappy disks (we had many hundreds of them). So I'm fairly happy with that outcome. Do I wish we had better disks? Yes, definitely. But we don't, so ZFS will have to do it.

I will try to find out what is blocking the barrier, and will post updates here if I find anything.

On 05/11/2024 17:22, Jon Haddad wrote:
Yeah, I looked through your stack trace and saw it wasn't the same thing, but the steps to identify the root cause should be the same.

I nuked ZFS from orbit :)   This was happening across all the machines at various times in the cluster, and we haven't seen a single issue since switching to XFS.

Thanks for the advice though, I'll keep it in mind if I encounter it again.

Jon

On Tue, Nov 5, 2024 at 9:18 AM Bowen Song via user <user@cassandra.apache.org> wrote:

    Hi Jon,

    That is interesting. We happen to be running Cassandra on ZFS.
    However we have not had any incident for years with this setup,
    the only change is the recent addition of CDC.

    I can see that in CASSANDRA-19564, the MemtablePostFlush thread
    was stuck on the unlink() syscall. But in our case, it was stuck here:

    "MemtablePostFlush:1" daemon prio=5 tid=237 WAITING
        at jdk.internal.misc.Unsafe.park(Native Method)
        at
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
        at
    
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
           local variable:
    org.apache.cassandra.utils.concurrent.WaitQueue$Standard$RegisteredSignal#1
        at
    
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
        at
    
org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
           local variable:
    org.apache.cassandra.utils.concurrent.CountDownLatch$Async#7
        at
    
org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:338)
        at
    
org.apache.cassandra.db.ColumnFamilyStore$PostFlush.call(ColumnFamilyStore.java:1094)
           local variable:
    org.apache.cassandra.db.ColumnFamilyStore$PostFlush#7
        at
    
org.apache.cassandra.db.ColumnFamilyStore$PostFlush.call(ColumnFamilyStore.java:1077)
        at
    org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
        at
    org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
           local variable: org.apache.cassandra.concurrent.FutureTask#2555
           local variable: org.apache.cassandra.concurrent.FutureTask#2555
        at
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
           local variable:
    org.apache.cassandra.concurrent.SingleThreadExecutorPlus#5
           local variable:
    java.util.concurrent.ThreadPoolExecutor$Worker#9
           local variable:
    io.netty.util.concurrent.FastThreadLocalThread#11
           local variable: org.apache.cassandra.concurrent.FutureTask#2555
        at
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
           local variable:
    java.util.concurrent.ThreadPoolExecutor$Worker#9
        at
    
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
           local variable:
    io.netty.util.concurrent.FastThreadLocalRunnable#11
        at java.lang.Thread.run(Thread.java:829)
           local variable:
    io.netty.util.concurrent.FastThreadLocalThread#11

    Which made me believe that they are not the same issue.

    Thanks for the info though.

    BTW, if you still have access to that oddly behaving ZFS
    filesystem, have a look at the free space (should have at least
    20% free) and the fragmentation ratio in "zpool list". If neither
    is an issue, check whether deduplication is enabled or there's a
    large number of snapshots. These can have significant impact on
    file deletion performance on ZFS. Also worth checking the disks, I
    have seen broken disks that stuck on some operations, e.g. when a
    specific sector is being read, and this will certainly affect the
    filesystem behaviour.

    Cheers,
    Bowen

    On 05/11/2024 16:41, Jon Haddad wrote:
    I ran into this a few months ago, and in my case I tracked it
    down to an issue with ZFS not unlinking commitlogs properly.

    https://issues.apache.org/jira/browse/CASSANDRA-19564

    On Tue, Nov 5, 2024 at 6:05 AM Dmitry Konstantinov
    <netud...@gmail.com> wrote:

        I am speaking about a thread dump (stack traces for all
        threads), not a heap dump. The heap dump should contain
        thread stacks info.
        Thread dump (stack traces) is small and does not have
        sensitive info.

        Regards,
        Dmitry

        On Tue, 5 Nov 2024 at 13:53, Bowen Song via user
        <user@cassandra.apache.org> wrote:

            It's about 18GB in size and may contain a huge amount of
            sensitive data (e.g. all the pending writes), so I can't
            share it. However, if there's any particular piece of
            information you would like to have, I'm more than happy
            to extract the info from the dump and and share it here.

            On 05/11/2024 13:01, Dmitry Konstantinov wrote:
            Hi Bowen, would it be possible to share a full thread dump?

            Regards,
            Dmitry

            On Tue, 5 Nov 2024 at 12:12, Bowen Song via user
            <user@cassandra.apache.org> wrote:

                Hi all,

                We have a cluster running Cassandra 4.1.1. We are
                seeing the memtable flush randomly getting stuck.
                This has happened twice in the last 10 days, to two
                different nodes in the same cluster. This started to
                happen after we enabled CDC, and each time it got
                stuck, there was at least one repair running
                involving the affected node.

                The signs of the stuck memtable flush is most
                obvious from the "StatusLogger" logs.

                At the beginning, the MemtablePostFlush and
                MemtableFlushWriter got stuck, they have 1 and 2
                active tasks each, and a small number of pending tasks.

                INFO [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:65 - Pool Name Active  
                Pending      Completed Blocked  All Time Blocked
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - ReadStage 0         0      
                34052333 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - CompactionExecutor 0        
                0        1019777 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - MutationStage 0        
                0       14930764 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - MemtableReclaimMemory
                0         0          21877 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - PendingRangeCalculator
                0         0            177 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - Repair#61 0        
                0           1344 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - GossipStage 0        
                0         889452 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - SecondaryIndexManagement
                0         0              1 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - HintsDispatcher 0        
                0             19 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - Repair-Task 0        
                0             65 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - RequestResponseStage
                0         0       44447834 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - Native-Transport-Requests
                0         0        8967921 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - MigrationStage 0        
                0              5 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - MemtableFlushWriter 2       
                10          21781 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - MemtablePostFlush 1       
                11          47856 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - PerDiskMemtableFlushWriter_0
                0         0          21769 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - Sampler 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - ValidationExecutor 0        
                0          36651 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - ViewBuildExecutor 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - InternalResponseStage
                0         0            240 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - AntiEntropyStage 1     
                1773         120067 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224
                StatusLogger.java:69 - CacheCleanupExecutor
                0         0              0 0                 0

                The number of pending tasks slowly grows larger over
                time, and the number of completed tasks do not
                increase at all.

                INFO [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:65 - Pool Name Active  
                Pending      Completed Blocked  All Time Blocked
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - ReadStage 0         0      
                39905462 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - CompactionExecutor 0        
                0        1170100 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - MutationStage 0        
                0       16976992 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - Repair#76 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - Repair#74 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - Repair#72 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - GossipStage 0        
                0        1026877 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - Repair-Task 0        
                0             83 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - RequestResponseStage
                0         0       51122191 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - ValidationExecutor 0        
                0          36651 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - MemtablePostFlush 1       
                54          47856 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - MemtableFlushWriter 2       
                53          21781 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - CacheCleanupExecutor
                0         0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - Repair#67 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - MemtableReclaimMemory
                0         0          21877 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - PendingRangeCalculator
                0         0            177 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - Repair#82 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - Repair#61 0        
                0           1344 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - SecondaryIndexManagement
                0         0              1 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - HintsDispatcher 0        
                0             20 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - Native-Transport-Requests
                0         0       10145020 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - MigrationStage 0        
                0              5 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - Sampler 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941
                StatusLogger.java:69 - PerDiskMemtableFlushWriter_0
                0         0          21769 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,942
                StatusLogger.java:69 - ViewBuildExecutor 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,942
                StatusLogger.java:69 - InternalResponseStage
                0         0            996 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,942
                StatusLogger.java:69 - AntiEntropyStage 1     
                9694         120067 0                 0

                Then, things start to get worse, the MutationStage
                also got stuck.

                INFO [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:65 - Pool Name Active  
                Pending      Completed Blocked  All Time Blocked
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - ReadStage 0         0      
                40186426 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - CompactionExecutor 0        
                0        1171479 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - MutationStage 128   
                201414       16988212 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - Repair#76 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - Repair#74 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - Repair#72 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - GossipStage 0        
                0        1032988 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - Repair-Task 0        
                0             85 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - RequestResponseStage
                0         0       51307497 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - ValidationExecutor 0        
                0          36651 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - MemtablePostFlush 1       
                54          47856 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - MemtableFlushWriter 2       
                53          21781 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - CacheCleanupExecutor
                0         0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - Repair#67 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - MemtableReclaimMemory
                0         0          21877 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - Repair#84 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - PendingRangeCalculator
                0         0            177 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - Repair#82 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - Repair#61 0        
                0           1344 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - SecondaryIndexManagement
                0         0              1 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - HintsDispatcher 0        
                0             20 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - Native-Transport-Requests
                0         0       10194549 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - MigrationStage 0        
                0              5 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - Sampler 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - PerDiskMemtableFlushWriter_0
                0         0          21769 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - ViewBuildExecutor 0        
                0              0 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - InternalResponseStage
                0         0           1087 0                 0
                INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974
                StatusLogger.java:69 - AntiEntropyStage 1    
                10364         120067 0                 0

                Writes on this node starts to timeout and fail. But
                if left untouched, it's only gonna get worse, and
                eventually lead to JVM OOM and crash.

                By inspecting the heap dump created at OOM, we can
                see that both of the MemtableFlushWriter threads are
                stuck on line 1190
                
<https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190>
                in the ColumnFamilyStore.java:

                            // mark writes older than the barrier as
                blocking progress, permitting them to exceed our
                memory limit
                            // if they are stuck waiting on it, then
                wait for them all to complete
                writeBarrier.markBlocking();
                            writeBarrier.await(); // <-----------
                stuck here

                And the MemtablePostFlush thread is stuck on line
                1094
                
<https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094>
                in the same file.

                try
                            {
                                // we wait on the latch for the
                commitLogUpperBound to be set, and so that waiters
                                // on this task can rely on all
                prior flushes being complete
                                latch.await();   // <-----------
                stuck here
                            }
                Our top suspect is CDC interacting with repair,
                since this started to happen shortly after we
                enabled CDC on the nodes, and each time repair was
                running. But we have not been able to reproduce this
                in a testing cluster, and don't know what's the next
                step to troubleshoot this issue. So I'm posting it
                in the mailing lists and hoping someone may know
                something about it or point me to the right direction.

                p.s.: sorry about posting this to both the user &
                dev mailing lists. It's an end-user related issue
                but involves Cassandra internals, so I can't decide
                which one is best suited.

                Cheers,
                Bowen




-- Dmitry Konstantinov



-- Dmitry Konstantinov

Reply via email to