I think this is the correct explanation. It's very similar to CASSANDRA-19576, where compaction is unable to finish because we can't insert into compaction history.
Really good analysis, Jaydeep. Jon On Sun, Nov 10, 2024 at 1:51 PM Jaydeep Chovatia <chovatia.jayd...@gmail.com> wrote: > Upon further studying the thread dump, I think there is a deadlock between > CompactionExecutor (Thread1) & MemtableReclaimMemory (Thread2) in the stack > trace I have mentioned in my email below. > > Please find the deadlock details here: > > 1. *Thread1:* It has invoked *readOrdering.start()* on *baseCf* and > *indexCf* as part of the "*ReadExecutionController controller = > cmd.executionController()*" at a line no > > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/index/SecondaryIndexManager.java#L921> > 2. *Thread1:* It wants to apply the change locally to Memtable as part > of indexer.insertRow(row) at a line no > > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/index/SecondaryIndexManager.java#L962> > 3. *Thread1:* There is not enough space available in Memtable; hence, > Thread1 is blocked on space to become available > 4. *Thread2:* It wants to free up the space in Memtable, but before it > does that, it is waiting > > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1350> > on all the *readOrdering*for a given Cf to finish. > 5. *Thread2**:* But Thread2's indefinite wait cannot succeed because > Thread1 has already invoked "*.start()*" on it, so unless and until > Thread1 calls *".close()"* Threa2 cannot proceed. And here is the > *deadlock*. > > Here are the stack traces for Thread1 and Thread2 to prove the deadlock, > as mentioned above. > > Thread1 (*CompactionExecutor*) > > "CompactionExecutor:5" - Thread t@118 > java.lang.Thread.State: WAITING > at java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@11.0.20.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) > at > app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) > at > app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) > at > app//org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitThrowUncheckedOnInterrupt(Awaitable.java:131) > at > app//org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitThrowUncheckedOnInterrupt(Awaitable.java:235) > at > app//org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.allocate(MemtableAllocator.java:195) > at > app//org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:89) > at > app//org.apache.cassandra.utils.memory.MemtableBufferAllocator$1.allocate(MemtableBufferAllocator.java:40) > at > app//org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:77) > at > app//org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:63) > at app//org.apache.cassandra.db.Clustering.clone(Clustering.java:54) > at > app//org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:52) > at > app//org.apache.cassandra.db.rows.BTreeRow.clone(BTreeRow.java:517) at > app//org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:375) > at > app//org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:351) > at > app//org.apache.cassandra.utils.btree.BTree.updateLeaves(BTree.java:461) > at app//org.apache.cassandra.utils.btree.BTree.update(BTree.java:368) > at > app//org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDeltaInternal(AtomicBTreePartition.java:146) > at > app//org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:190) > at > app//org.apache.cassandra.db.memtable.SkipListMemtable.put(SkipListMemtable.java:134) > at > app//org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1431) > at > app//org.apache.cassandra.db.CassandraTableWriteHandler.write(CassandraTableWriteHandler.java:40) > at > app//org.apache.cassandra.index.internal.CassandraIndex.insert(CassandraIndex.java:529) > at > app//org.apache.cassandra.index.internal.CassandraIndex$1.indexCell(CassandraIndex.java:445) > at > app//org.apache.cassandra.index.internal.CassandraIndex$1.indexCells(CassandraIndex.java:437) > at > app//org.apache.cassandra.index.internal.CassandraIndex$1.insertRow(CassandraIndex.java:387) > at > app//org.apache.cassandra.index.SecondaryIndexManager.lambda$indexPartition$25(SecondaryIndexManager.java:985) > at > app//org.apache.cassandra.index.SecondaryIndexManager$$Lambda$1396/0x00007f00861e2cb0.accept(Unknown > Source) at > java.base@11.0.20.1/java.lang.Iterable.forEach(Iterable.java:75) > at > app//org.apache.cassandra.index.SecondaryIndexManager.indexPartition(SecondaryIndexManager.java:985) > at > app//org.apache.cassandra.index.internal.CollatedViewIndexBuilder.build(CollatedViewIndexBuilder.java:78) > at > app//org.apache.cassandra.db.compaction.CompactionManager$13.run(CompactionManager.java:1888) > at > app//org.apache.cassandra.concurrent.FutureTask$3.call(FutureTask.java:141) > at > app//org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > at app//org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > at > java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base@11.0.20.1/java.lang.Thread.run(Thread.java:829) > > > > Thread2 (*MemtableReclaimMemory*) > > "MemtableReclaimMemory:1" - Thread t@56 > java.lang.Thread.State: WAITING > at java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@11.0.20.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) > at > app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) > at > app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) > at > app//org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186) > at > app//org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259) > at > app//org.apache.cassandra.utils.concurrent.OpOrder$Group.await(OpOrder.java:267) > at > app//org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:425) > at > app//org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1357) > at > app//org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > app//org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) > at > java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base@11.0.20.1/java.lang.Thread.run(Thread.java:829) > > Jaydeep > > On Wed, Nov 6, 2024 at 8:29 AM Jaydeep Chovatia < > chovatia.jayd...@gmail.com> wrote: > >> We have also seen this issue a few times in our production (4.1). My >> teammates added a thread dump here >> <https://issues.apache.org/jira/browse/CASSANDRA-19564?focusedCommentId=17893987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17893987> >> . >> >> One of my theories is that the MemtableReclaimMemory thread has been >> unable to make progress: >> >> "MemtableReclaimMemory:1" - Thread t@56 >> >> java.lang.Thread.State: WAITING >> >> at java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native >> Method) >> >> at >> java.base@11.0.20.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) >> >> at >> app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) >> >> at >> app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) >> >> at >> app//org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186) >> >> at >> app//org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259) >> >> at >> app//org.apache.cassandra.utils.concurrent.OpOrder$Group.await(OpOrder.java:267) >> >> at >> app//org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:425) >> >> at >> app//org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1357) >> >> at >> app//org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >> >> at >> app//org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) >> >> at >> java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >> >> at >> java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >> >> at >> app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >> >> at java.base@11.0.20.1/java.lang.Thread.run(Thread.java:829) >> >> >> >> One of the possible bugs I could see is a line >> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/ReadExecutionController.java#L141C77-L141C97> >> * indexController = new ReadExecutionController(command, >> indexCfs.readOrdering.start(), indexCfs.metadata(), null, null, >> NO_SAMPLING, false);* >> If "*indexCfs.readOrdering.start()*" succeeded but the constructor "*new >> ReadExecutionController*", then we are not closing " >> *indexCfs.readOrdering*", which means it will remain >> inaccurate forever. >> >> >> Jaydeep >> >> On Wed, Nov 6, 2024 at 5:51 AM Bowen Song via user < >> user@cassandra.apache.org> wrote: >> >>> I think I'm getting really close now. This seems to have something to do >>> with the "read-hotness-tracker:1" thread. The thread dump is: >>> >>> "read-hotness-tracker:1" daemon prio=5 tid=93 WAITING >>> at jdk.internal.misc.Unsafe.park(Native Method) >>> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) >>> at >>> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) >>> local variable: >>> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139 >>> at >>> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) >>> local variable: >>> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139 >>> at >>> org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186) >>> local variable: >>> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139 >>> at >>> org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259) >>> at >>> org.apache.cassandra.db.commitlog.AbstractCommitLogService.awaitSyncAt(AbstractCommitLogService.java:324) >>> <------ here >>> local variable: >>> org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1 >>> local variable: com.codahale.metrics.Timer$Context#2086795 >>> at >>> org.apache.cassandra.db.commitlog.PeriodicCommitLogService.maybeWaitForSync(PeriodicCommitLogService.java:42) >>> local variable: >>> org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1 >>> at >>> org.apache.cassandra.db.commitlog.AbstractCommitLogService.finishWriteFor(AbstractCommitLogService.java:284) >>> local variable: >>> org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1 >>> at >>> org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:307) >>> local variable: org.apache.cassandra.db.commitlog.CommitLog#1 >>> local variable: >>> org.apache.cassandra.io.util.DataOutputBuffer$1$1#61 >>> local variable: >>> org.apache.cassandra.db.commitlog.CommitLogSegment$Allocation#1 >>> at >>> org.apache.cassandra.db.CassandraKeyspaceWriteHandler.addToCommitLog(CassandraKeyspaceWriteHandler.java:100) >>> at >>> org.apache.cassandra.db.CassandraKeyspaceWriteHandler.beginWrite(CassandraKeyspaceWriteHandler.java:54) >>> <------ and here >>> local variable: >>> org.apache.cassandra.utils.concurrent.OpOrder$Group#8162 >>> at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:628) >>> local variable: org.apache.cassandra.db.Keyspace#8 >>> local variable: org.apache.cassandra.db.Mutation#54491 >>> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:512) >>> at org.apache.cassandra.db.Mutation.apply(Mutation.java:228) >>> at org.apache.cassandra.db.Mutation.apply(Mutation.java:248) >>> at >>> org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:675) >>> at >>> org.apache.cassandra.cql3.statements.ModificationStatement.executeLocally(ModificationStatement.java:666) >>> at >>> org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:447) >>> at >>> org.apache.cassandra.db.SystemKeyspace.persistSSTableReadMeter(SystemKeyspace.java:1488) >>> local variable: java.lang.String#42806 >>> local variable: java.lang.String#47659 >>> local variable: >>> org.apache.cassandra.io.sstable.UUIDBasedSSTableId#1389 >>> local variable: org.apache.cassandra.metrics.RestorableMeter#752 >>> at >>> org.apache.cassandra.io.sstable.format.SSTableReader$GlobalTidy$1.run(SSTableReader.java:2170) >>> at >>> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124) >>> local variable: >>> org.apache.cassandra.concurrent.ExecutionFailure$1#122029 >>> local variable: >>> org.apache.cassandra.utils.WithResources$None$$Lambda$223#1 >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) >>> local variable: java.util.concurrent.Executors$RunnableAdapter#934 >>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) >>> local variable: >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931 >>> at >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) >>> local variable: >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931 >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >>> local variable: >>> org.apache.cassandra.concurrent.ScheduledThreadPoolExecutorPlus#6 >>> local variable: java.util.concurrent.ThreadPoolExecutor$Worker#7 >>> local variable: io.netty.util.concurrent.FastThreadLocalThread#7 >>> local variable: >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931 >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >>> local variable: java.util.concurrent.ThreadPoolExecutor$Worker#7 >>> at >>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >>> local variable: io.netty.util.concurrent.FastThreadLocalRunnable#7 >>> at java.lang.Thread.run(Thread.java:829) >>> local variable: io.netty.util.concurrent.FastThreadLocalThread#7 >>> >>> >>> In the CassandraKeyspaceWriteHandler.beginWrite() method: >>> >>> public WriteContext beginWrite(Mutation mutation, boolean makeDurable) >>> throws RequestExecutionException >>> { >>> OpOrder.Group group = null; >>> try >>> { >>> group = Keyspace.writeOrder.start(); // <--- here >>> >>> // write the mutation to the commitlog and memtables >>> CommitLogPosition position = null; >>> if (makeDurable) >>> { >>> position = addToCommitLog(mutation); // <--- and here >>> } >>> return new CassandraWriteContext(group, position); >>> } >>> catch (Throwable t) >>> { >>> if (group != null) >>> { >>> group.close(); >>> } >>> throw t; >>> } >>> } >>> >>> The Keyspace.writeOrder.start() method call blocks the >>> MemtableFlushWriter thread, and then it calls addToCommitLog(mutation) which >>> indirectly calls AbstractCommitLogService.awaitSyncAt() >>> >>> void awaitSyncAt(long syncTime, Context context) >>> { >>> do >>> { >>> WaitQueue.Signal signal = context != null ? >>> syncComplete.register(context, Context::stop) : syncComplete.register(); >>> if (lastSyncedAt < syncTime) >>> signal.awaitUninterruptibly(); // <--- here >>> else >>> signal.cancel(); >>> } >>> while (lastSyncedAt < syncTime); >>> } >>> >>> This then got stuck on the signal.awaitUninterruptibly() >>> >>> Now I know what is blocking the memtable flushing, but I haven't been >>> able to figure out is why it got stuck on waiting for that signal. >>> >>> I would appreciate it if anyone can offer some insight here. >>> >>> >>> On 05/11/2024 17:48, Bowen Song via user wrote: >>> >>> I will give it a try and see what I can find. I plan to go down the >>> rabbit hole tomorrow. Will keep you updated. >>> On 05/11/2024 17:34, Jeff Jirsa wrote: >>> >>> >>> >>> On Nov 5, 2024, at 4:12 AM, Bowen Song via user >>> <user@cassandra.apache.org> <user@cassandra.apache.org> wrote: >>> >>> Writes on this node starts to timeout and fail. But if left untouched, >>> it's only gonna get worse, and eventually lead to JVM OOM and crash. >>> >>> By inspecting the heap dump created at OOM, we can see that both of the >>> MemtableFlushWriter threads are stuck on line 1190 >>> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190> >>> in the ColumnFamilyStore.java: >>> >>> // mark writes older than the barrier as blocking progress, >>> permitting them to exceed our memory limit >>> // if they are stuck waiting on it, then wait for them all >>> to complete >>> writeBarrier.markBlocking(); >>> writeBarrier.await(); // <----------- stuck here >>> >>> And the MemtablePostFlush thread is stuck on line 1094 >>> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094> >>> in the same file. >>> >>> try >>> { >>> // we wait on the latch for the commitLogUpperBound to >>> be set, and so that waiters >>> // on this task can rely on all prior flushes being >>> complete >>> latch.await(); // <----------- stuck here >>> } >>> Our top suspect is CDC interacting with repair, since this started to >>> happen shortly after we enabled CDC on the nodes, and each time repair was >>> running. But we have not been able to reproduce this in a testing cluster, >>> and don't know what's the next step to troubleshoot this issue. So I'm >>> posting it in the mailing lists and hoping someone may know something about >>> it or point me to the right direction. >>> >>> >>> Wouldn’t be completely surprised if CDC or repair somehow has a >>> barrier, I’ve also seen similar behavior pre-3.0 with “very long running >>> read commands” that have a barrier on the memtable that prevent release. >>> >>> You’ve got the heap (great, way better than most people debugging), are >>> you able to navigate through it and look for references to that memtable or >>> other things holding a barrier? >>> >>> >>> >>> >>>