We have also seen this issue a few times in our production (4.1). My teammates added a thread dump here <https://issues.apache.org/jira/browse/CASSANDRA-19564?focusedCommentId=17893987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17893987> .
One of my theories is that the MemtableReclaimMemory thread has been unable to make progress: "MemtableReclaimMemory:1" - Thread t@56 java.lang.Thread.State: WAITING at java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native Method) at java.base@11.0.20.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) at app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) at app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) at app//org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186) at app//org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259) at app//org.apache.cassandra.utils.concurrent.OpOrder$Group.await(OpOrder.java:267) at app//org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:425) at app//org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1357) at app//org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at app//org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) at java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base@11.0.20.1/java.lang.Thread.run(Thread.java:829) One of the possible bugs I could see is a line <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/ReadExecutionController.java#L141C77-L141C97> * indexController = new ReadExecutionController(command, indexCfs.readOrdering.start(), indexCfs.metadata(), null, null, NO_SAMPLING, false);* If "*indexCfs.readOrdering.start()*" succeeded but the constructor "*new ReadExecutionController*", then we are not closing "*indexCfs.readOrdering*", which means it will remain inaccurate forever. Jaydeep On Wed, Nov 6, 2024 at 5:51 AM Bowen Song via user < user@cassandra.apache.org> wrote: > I think I'm getting really close now. This seems to have something to do > with the "read-hotness-tracker:1" thread. The thread dump is: > > "read-hotness-tracker:1" daemon prio=5 tid=93 WAITING > at jdk.internal.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) > at > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) > local variable: > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139 > at > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) > local variable: > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139 > at > org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186) > local variable: > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139 > at > org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259) > at > org.apache.cassandra.db.commitlog.AbstractCommitLogService.awaitSyncAt(AbstractCommitLogService.java:324) > <------ here > local variable: > org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1 > local variable: com.codahale.metrics.Timer$Context#2086795 > at > org.apache.cassandra.db.commitlog.PeriodicCommitLogService.maybeWaitForSync(PeriodicCommitLogService.java:42) > local variable: > org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1 > at > org.apache.cassandra.db.commitlog.AbstractCommitLogService.finishWriteFor(AbstractCommitLogService.java:284) > local variable: > org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1 > at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:307) > local variable: org.apache.cassandra.db.commitlog.CommitLog#1 > local variable: org.apache.cassandra.io.util.DataOutputBuffer$1$1#61 > local variable: > org.apache.cassandra.db.commitlog.CommitLogSegment$Allocation#1 > at > org.apache.cassandra.db.CassandraKeyspaceWriteHandler.addToCommitLog(CassandraKeyspaceWriteHandler.java:100) > at > org.apache.cassandra.db.CassandraKeyspaceWriteHandler.beginWrite(CassandraKeyspaceWriteHandler.java:54) > <------ and here > local variable: > org.apache.cassandra.utils.concurrent.OpOrder$Group#8162 > at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:628) > local variable: org.apache.cassandra.db.Keyspace#8 > local variable: org.apache.cassandra.db.Mutation#54491 > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:512) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:228) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:248) > at > org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:675) > at > org.apache.cassandra.cql3.statements.ModificationStatement.executeLocally(ModificationStatement.java:666) > at > org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:447) > at > org.apache.cassandra.db.SystemKeyspace.persistSSTableReadMeter(SystemKeyspace.java:1488) > local variable: java.lang.String#42806 > local variable: java.lang.String#47659 > local variable: > org.apache.cassandra.io.sstable.UUIDBasedSSTableId#1389 > local variable: org.apache.cassandra.metrics.RestorableMeter#752 > at > org.apache.cassandra.io.sstable.format.SSTableReader$GlobalTidy$1.run(SSTableReader.java:2170) > at > org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124) > local variable: > org.apache.cassandra.concurrent.ExecutionFailure$1#122029 > local variable: > org.apache.cassandra.utils.WithResources$None$$Lambda$223#1 > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > local variable: java.util.concurrent.Executors$RunnableAdapter#934 > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) > local variable: > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931 > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) > local variable: > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931 > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > local variable: > org.apache.cassandra.concurrent.ScheduledThreadPoolExecutorPlus#6 > local variable: java.util.concurrent.ThreadPoolExecutor$Worker#7 > local variable: io.netty.util.concurrent.FastThreadLocalThread#7 > local variable: > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931 > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > local variable: java.util.concurrent.ThreadPoolExecutor$Worker#7 > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > local variable: io.netty.util.concurrent.FastThreadLocalRunnable#7 > at java.lang.Thread.run(Thread.java:829) > local variable: io.netty.util.concurrent.FastThreadLocalThread#7 > > > In the CassandraKeyspaceWriteHandler.beginWrite() method: > > public WriteContext beginWrite(Mutation mutation, boolean makeDurable) > throws RequestExecutionException > { > OpOrder.Group group = null; > try > { > group = Keyspace.writeOrder.start(); // <--- here > > // write the mutation to the commitlog and memtables > CommitLogPosition position = null; > if (makeDurable) > { > position = addToCommitLog(mutation); // <--- and here > } > return new CassandraWriteContext(group, position); > } > catch (Throwable t) > { > if (group != null) > { > group.close(); > } > throw t; > } > } > > The Keyspace.writeOrder.start() method call blocks the > MemtableFlushWriter thread, and then it calls addToCommitLog(mutation) which > indirectly calls AbstractCommitLogService.awaitSyncAt() > > void awaitSyncAt(long syncTime, Context context) > { > do > { > WaitQueue.Signal signal = context != null ? > syncComplete.register(context, Context::stop) : syncComplete.register(); > if (lastSyncedAt < syncTime) > signal.awaitUninterruptibly(); // <--- here > else > signal.cancel(); > } > while (lastSyncedAt < syncTime); > } > > This then got stuck on the signal.awaitUninterruptibly() > > Now I know what is blocking the memtable flushing, but I haven't been able > to figure out is why it got stuck on waiting for that signal. > > I would appreciate it if anyone can offer some insight here. > > > On 05/11/2024 17:48, Bowen Song via user wrote: > > I will give it a try and see what I can find. I plan to go down the rabbit > hole tomorrow. Will keep you updated. > On 05/11/2024 17:34, Jeff Jirsa wrote: > > > > On Nov 5, 2024, at 4:12 AM, Bowen Song via user > <user@cassandra.apache.org> <user@cassandra.apache.org> wrote: > > Writes on this node starts to timeout and fail. But if left untouched, > it's only gonna get worse, and eventually lead to JVM OOM and crash. > > By inspecting the heap dump created at OOM, we can see that both of the > MemtableFlushWriter threads are stuck on line 1190 > <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190> > in the ColumnFamilyStore.java: > > // mark writes older than the barrier as blocking progress, > permitting them to exceed our memory limit > // if they are stuck waiting on it, then wait for them all to > complete > writeBarrier.markBlocking(); > writeBarrier.await(); // <----------- stuck here > > And the MemtablePostFlush thread is stuck on line 1094 > <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094> > in the same file. > > try > { > // we wait on the latch for the commitLogUpperBound to be > set, and so that waiters > // on this task can rely on all prior flushes being > complete > latch.await(); // <----------- stuck here > } > Our top suspect is CDC interacting with repair, since this started to > happen shortly after we enabled CDC on the nodes, and each time repair was > running. But we have not been able to reproduce this in a testing cluster, > and don't know what's the next step to troubleshoot this issue. So I'm > posting it in the mailing lists and hoping someone may know something about > it or point me to the right direction. > > > Wouldn’t be completely surprised if CDC or repair somehow has a barrier, > I’ve also seen similar behavior pre-3.0 with “very long running read > commands” that have a barrier on the memtable that prevent release. > > You’ve got the heap (great, way better than most people debugging), are > you able to navigate through it and look for references to that memtable or > other things holding a barrier? > > > > >