Re: Unexplained stuck memtable flush

Jon Haddad Tue, 05 Nov 2024 09:22:45 -0800

Yeah, I looked through your stack trace and saw it wasn't the same thing,
but the steps to identify the root cause should be the same.


I nuked ZFS from orbit :)   This was happening across all the machines at
various times in the cluster, and we haven't seen a single issue since
switching to XFS.

Thanks for the advice though, I'll keep it in mind if I encounter it again.

Jon

On Tue, Nov 5, 2024 at 9:18 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Hi Jon,
>
> That is interesting. We happen to be running Cassandra on ZFS. However we
> have not had any incident for years with this setup, the only change is the
> recent addition of CDC.
>
> I can see that in CASSANDRA-19564, the MemtablePostFlush thread was stuck
> on the unlink() syscall. But in our case, it was stuck here:
>
> "MemtablePostFlush:1" daemon prio=5 tid=237 WAITING
>     at jdk.internal.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
>     at
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
>        local variable:
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$RegisteredSignal#1
>     at
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
>     at
> org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
>        local variable:
> org.apache.cassandra.utils.concurrent.CountDownLatch$Async#7
>     at
> org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:338)
>     at
> org.apache.cassandra.db.ColumnFamilyStore$PostFlush.call(ColumnFamilyStore.java:1094)
>        local variable:
> org.apache.cassandra.db.ColumnFamilyStore$PostFlush#7
>     at
> org.apache.cassandra.db.ColumnFamilyStore$PostFlush.call(ColumnFamilyStore.java:1077)
>     at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
>     at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
>        local variable: org.apache.cassandra.concurrent.FutureTask#2555
>        local variable: org.apache.cassandra.concurrent.FutureTask#2555
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>        local variable:
> org.apache.cassandra.concurrent.SingleThreadExecutorPlus#5
>        local variable: java.util.concurrent.ThreadPoolExecutor$Worker#9
>        local variable: io.netty.util.concurrent.FastThreadLocalThread#11
>        local variable: org.apache.cassandra.concurrent.FutureTask#2555
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>        local variable: java.util.concurrent.ThreadPoolExecutor$Worker#9
>     at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>        local variable: io.netty.util.concurrent.FastThreadLocalRunnable#11
>     at java.lang.Thread.run(Thread.java:829)
>        local variable: io.netty.util.concurrent.FastThreadLocalThread#11
>
> Which made me believe that they are not the same issue.
>
> Thanks for the info though.
>
> BTW, if you still have access to that oddly behaving ZFS filesystem, have
> a look at the free space (should have at least 20% free) and the
> fragmentation ratio in "zpool list". If neither is an issue, check whether
> deduplication is enabled or there's a large number of snapshots. These can
> have significant impact on file deletion performance on ZFS. Also worth
> checking the disks, I have seen broken disks that stuck on some operations,
> e.g. when a specific sector is being read, and this will certainly affect
> the filesystem behaviour.
>
> Cheers,
> Bowen
> On 05/11/2024 16:41, Jon Haddad wrote:
>
> I ran into this a few months ago, and in my case I tracked it down to an
> issue with ZFS not unlinking commitlogs properly.
>
> https://issues.apache.org/jira/browse/CASSANDRA-19564
>
> On Tue, Nov 5, 2024 at 6:05 AM Dmitry Konstantinov <netud...@gmail.com>
> wrote:
>
>> I am speaking about a thread dump (stack traces for all threads), not a
>> heap dump. The heap dump should contain thread stacks info.
>> Thread dump (stack traces) is small and does not have sensitive info.
>>
>> Regards,
>> Dmitry
>>
>> On Tue, 5 Nov 2024 at 13:53, Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> It's about 18GB in size and may contain a huge amount of sensitive data
>>> (e.g. all the pending writes), so I can't share it. However, if there's any
>>> particular piece of information you would like to have, I'm more than happy
>>> to extract the info from the dump and and share it here.
>>> On 05/11/2024 13:01, Dmitry Konstantinov wrote:
>>>
>>> Hi Bowen, would it be possible to share a full thread dump?
>>>
>>> Regards,
>>> Dmitry
>>>
>>> On Tue, 5 Nov 2024 at 12:12, Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We have a cluster running Cassandra 4.1.1. We are seeing the memtable
>>>> flush randomly getting stuck. This has happened twice in the last 10 days,
>>>> to two different nodes in the same cluster. This started to happen after we
>>>> enabled CDC, and each time it got stuck, there was at least one repair
>>>> running involving the affected node.
>>>>
>>>> The signs of the stuck memtable flush is most obvious from the
>>>> "StatusLogger" logs.
>>>>
>>>> At the beginning, the MemtablePostFlush and MemtableFlushWriter got
>>>> stuck, they have 1 and 2 active tasks each, and a small number of pending
>>>> tasks.
>>>>
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:65 -
>>>> Pool Name                       Active   Pending      Completed   Blocked
>>>> All Time Blocked
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> ReadStage                            0         0       34052333
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> CompactionExecutor                   0         0        1019777
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> MutationStage                        0         0       14930764
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> MemtableReclaimMemory                0         0          21877
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> PendingRangeCalculator               0         0            177
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> Repair#61                            0         0           1344
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> GossipStage                          0         0         889452
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> SecondaryIndexManagement             0         0              1
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> HintsDispatcher                      0         0             19
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> Repair-Task                          0         0             65
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> RequestResponseStage                 0         0       44447834
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> Native-Transport-Requests            0         0        8967921
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> MigrationStage                       0         0              5
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> MemtableFlushWriter                  2        10          21781
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> MemtablePostFlush                    1        11          47856
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> PerDiskMemtableFlushWriter_0         0         0          21769
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> Sampler                              0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> ValidationExecutor                   0         0          36651
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> ViewBuildExecutor                    0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> InternalResponseStage                0         0            240
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> AntiEntropyStage                     1      1773         120067
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 03:50:15,224 StatusLogger.java:69 -
>>>> CacheCleanupExecutor                 0         0              0
>>>> 0                 0
>>>>
>>>> The number of pending tasks slowly grows larger over time, and the
>>>> number of completed tasks do not increase at all.
>>>>
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:65 -
>>>> Pool Name                       Active   Pending      Completed   Blocked
>>>> All Time Blocked
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> ReadStage                            0         0       39905462
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> CompactionExecutor                   0         0        1170100
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> MutationStage                        0         0       16976992
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> Repair#76                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> Repair#74                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> Repair#72                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> GossipStage                          0         0        1026877
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> Repair-Task                          0         0             83
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> RequestResponseStage                 0         0       51122191
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> ValidationExecutor                   0         0          36651
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> MemtablePostFlush                    1        54          47856
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> MemtableFlushWriter                  2        53          21781
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> CacheCleanupExecutor                 0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> Repair#67                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> MemtableReclaimMemory                0         0          21877
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> PendingRangeCalculator               0         0            177
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> Repair#82                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> Repair#61                            0         0           1344
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> SecondaryIndexManagement             0         0              1
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> HintsDispatcher                      0         0             20
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> Native-Transport-Requests            0         0       10145020
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> MigrationStage                       0         0              5
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> Sampler                              0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,941 StatusLogger.java:69 -
>>>> PerDiskMemtableFlushWriter_0         0         0          21769
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,942 StatusLogger.java:69 -
>>>> ViewBuildExecutor                    0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,942 StatusLogger.java:69 -
>>>> InternalResponseStage                0         0            996
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 16:33:05,942 StatusLogger.java:69 -
>>>> AntiEntropyStage                     1      9694         120067
>>>> 0                 0
>>>>
>>>> Then, things start to get worse, the MutationStage also got stuck.
>>>>
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:65 -
>>>> Pool Name                       Active   Pending      Completed   Blocked
>>>> All Time Blocked
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> ReadStage                            0         0       40186426
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> CompactionExecutor                   0         0        1171479
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> MutationStage                      128    201414       16988212
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> Repair#76                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> Repair#74                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> Repair#72                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> GossipStage                          0         0        1032988
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> Repair-Task                          0         0             85
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> RequestResponseStage                 0         0       51307497
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> ValidationExecutor                   0         0          36651
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> MemtablePostFlush                    1        54          47856
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> MemtableFlushWriter                  2        53          21781
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> CacheCleanupExecutor                 0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> Repair#67                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> MemtableReclaimMemory                0         0          21877
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> Repair#84                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> PendingRangeCalculator               0         0            177
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> Repair#82                            0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> Repair#61                            0         0           1344
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> SecondaryIndexManagement             0         0              1
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> HintsDispatcher                      0         0             20
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> Native-Transport-Requests            0         0       10194549
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> MigrationStage                       0         0              5
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> Sampler                              0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> PerDiskMemtableFlushWriter_0         0         0          21769
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> ViewBuildExecutor                    0         0              0
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> InternalResponseStage                0         0           1087
>>>> 0                 0
>>>> INFO  [ScheduledTasks:1] 2024-11-03 17:05:40,974 StatusLogger.java:69 -
>>>> AntiEntropyStage                     1     10364         120067
>>>> 0                 0
>>>>
>>>> Writes on this node starts to timeout and fail. But if left untouched,
>>>> it's only gonna get worse, and eventually lead to JVM OOM and crash.
>>>>
>>>> By inspecting the heap dump created at OOM, we can see that both of the
>>>> MemtableFlushWriter threads are stuck on line 1190
>>>> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190>
>>>> in the ColumnFamilyStore.java:
>>>>
>>>>             // mark writes older than the barrier as blocking progress,
>>>> permitting them to exceed our memory limit
>>>>             // if they are stuck waiting on it, then wait for them all
>>>> to complete
>>>>             writeBarrier.markBlocking();
>>>>             writeBarrier.await();   // <----------- stuck here
>>>>
>>>> And the MemtablePostFlush thread is stuck on line 1094
>>>> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094>
>>>> in the same file.
>>>>
>>>>             try
>>>>             {
>>>>                 // we wait on the latch for the commitLogUpperBound to
>>>> be set, and so that waiters
>>>>                 // on this task can rely on all prior flushes being
>>>> complete
>>>>                 latch.await();   // <----------- stuck here
>>>>             }
>>>> Our top suspect is CDC interacting with repair, since this started to
>>>> happen shortly after we enabled CDC on the nodes, and each time repair was
>>>> running. But we have not been able to reproduce this in a testing cluster,
>>>> and don't know what's the next step to troubleshoot this issue. So I'm
>>>> posting it in the mailing lists and hoping someone may know something about
>>>> it or point me to the right direction.
>>>>
>>>> p.s.: sorry about posting this to both the user & dev mailing lists.
>>>> It's an end-user related issue but involves Cassandra internals, so I can't
>>>> decide which one is best suited.
>>>>
>>>> Cheers,
>>>> Bowen
>>>>
>>>>
>>>>
>>>
>>> --
>>> Dmitry Konstantinov
>>>
>>>
>>
>> --
>> Dmitry Konstantinov
>>
>

Re: Unexplained stuck memtable flush

Reply via email to