Re: Unexplained stuck memtable flush

Jaydeep Chovatia Sun, 10 Nov 2024 13:51:19 -0800

Upon further studying the thread dump, I think there is a deadlock between
CompactionExecutor (Thread1) & MemtableReclaimMemory (Thread2) in the stack
trace I have mentioned in my email below.


Please find the deadlock details here:

   1. *Thread1:* It has invoked *readOrdering.start()* on *baseCf* and
   *indexCf* as part of the "*ReadExecutionController controller =
   cmd.executionController()*" at a line no
   
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/index/SecondaryIndexManager.java#L921>
   2. *Thread1:* It wants to apply the change locally to Memtable as part
   of indexer.insertRow(row) at a line no
   
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/index/SecondaryIndexManager.java#L962>
   3. *Thread1:* There is not enough space available in Memtable; hence,
   Thread1 is blocked on space to become available
   4. *Thread2:* It wants to free up the space in Memtable, but before it
   does that, it is waiting
   
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1350>
    on all the *readOrdering*for a given Cf to finish.
   5. *Thread2**:* But Thread2's indefinite wait cannot succeed because
   Thread1 has already invoked "*.start()*" on it, so unless and until
   Thread1 calls *".close()"* Threa2 cannot proceed. And here is the
   *deadlock*.

Here are the stack traces for Thread1 and Thread2 to prove the deadlock, as
mentioned above.

Thread1 (*CompactionExecutor*)

"CompactionExecutor:5" - Thread t@118
   java.lang.Thread.State: WAITING
        at java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native Method)
        at 
java.base@11.0.20.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
        at 
app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
       at 
app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
       at 
app//org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitThrowUncheckedOnInterrupt(Awaitable.java:131)
       at 
app//org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitThrowUncheckedOnInterrupt(Awaitable.java:235)
       at 
app//org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.allocate(MemtableAllocator.java:195)
       at 
app//org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:89)
       at 
app//org.apache.cassandra.utils.memory.MemtableBufferAllocator$1.allocate(MemtableBufferAllocator.java:40)
       at 
app//org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:77)
       at 
app//org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:63)
       at app//org.apache.cassandra.db.Clustering.clone(Clustering.java:54)
       at 
app//org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:52)
       at app//org.apache.cassandra.db.rows.BTreeRow.clone(BTreeRow.java:517)
       at 
app//org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:375)
       at 
app//org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:351)
       at 
app//org.apache.cassandra.utils.btree.BTree.updateLeaves(BTree.java:461)
       at app//org.apache.cassandra.utils.btree.BTree.update(BTree.java:368)
       at 
app//org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDeltaInternal(AtomicBTreePartition.java:146)
       at 
app//org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:190)
       at 
app//org.apache.cassandra.db.memtable.SkipListMemtable.put(SkipListMemtable.java:134)
       at 
app//org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1431)
       at 
app//org.apache.cassandra.db.CassandraTableWriteHandler.write(CassandraTableWriteHandler.java:40)
       at 
app//org.apache.cassandra.index.internal.CassandraIndex.insert(CassandraIndex.java:529)
       at 
app//org.apache.cassandra.index.internal.CassandraIndex$1.indexCell(CassandraIndex.java:445)
       at 
app//org.apache.cassandra.index.internal.CassandraIndex$1.indexCells(CassandraIndex.java:437)
       at 
app//org.apache.cassandra.index.internal.CassandraIndex$1.insertRow(CassandraIndex.java:387)
       at 
app//org.apache.cassandra.index.SecondaryIndexManager.lambda$indexPartition$25(SecondaryIndexManager.java:985)
       at 
app//org.apache.cassandra.index.SecondaryIndexManager$$Lambda$1396/0x00007f00861e2cb0.accept(Unknown
Source)        at
java.base@11.0.20.1/java.lang.Iterable.forEach(Iterable.java:75)
        at 
app//org.apache.cassandra.index.SecondaryIndexManager.indexPartition(SecondaryIndexManager.java:985)
       at 
app//org.apache.cassandra.index.internal.CollatedViewIndexBuilder.build(CollatedViewIndexBuilder.java:78)
       at 
app//org.apache.cassandra.db.compaction.CompactionManager$13.run(CompactionManager.java:1888)
       at 
app//org.apache.cassandra.concurrent.FutureTask$3.call(FutureTask.java:141)
       at 
app//org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
       at 
app//org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
       at 
java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at 
app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       at java.base@11.0.20.1/java.lang.Thread.run(Thread.java:829)



Thread2 (*MemtableReclaimMemory*)

"MemtableReclaimMemory:1" - Thread t@56
   java.lang.Thread.State: WAITING
        at java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native Method)
        at 
java.base@11.0.20.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
        at 
app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
       at 
app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
       at 
app//org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
       at 
app//org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259)
       at 
app//org.apache.cassandra.utils.concurrent.OpOrder$Group.await(OpOrder.java:267)
       at 
app//org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:425)
       at 
app//org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1357)
       at 
app//org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
       at 
app//org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
       at 
java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at 
app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       at java.base@11.0.20.1/java.lang.Thread.run(Thread.java:829)

Jaydeep

On Wed, Nov 6, 2024 at 8:29 AM Jaydeep Chovatia <chovatia.jayd...@gmail.com>
wrote:

> We have also seen this issue a few times in our production (4.1). My
> teammates added a thread dump here
> <https://issues.apache.org/jira/browse/CASSANDRA-19564?focusedCommentId=17893987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17893987>
> .
>
> One of my theories is that the MemtableReclaimMemory thread has been
> unable to make progress:
>
> "MemtableReclaimMemory:1" - Thread t@56
>
>    java.lang.Thread.State: WAITING
>
>         at java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native
> Method)
>
>         at
> java.base@11.0.20.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
>
>         at
> app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
>
>         at
> app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
>
>         at
> app//org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
>
>         at
> app//org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259)
>
>         at
> app//org.apache.cassandra.utils.concurrent.OpOrder$Group.await(OpOrder.java:267)
>
>         at
> app//org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:425)
>
>         at
> app//org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1357)
>
>         at
> app//org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>
>         at
> app//org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
>
>         at
> java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>
>         at
> java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>
>         at
> app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>
>         at java.base@11.0.20.1/java.lang.Thread.run(Thread.java:829)
>
>
>
> One of the possible bugs I could see is a line
> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/ReadExecutionController.java#L141C77-L141C97>
> *            indexController = new ReadExecutionController(command,
> indexCfs.readOrdering.start(), indexCfs.metadata(), null, null,
> NO_SAMPLING, false);*
> If "*indexCfs.readOrdering.start()*" succeeded but the constructor "*new
> ReadExecutionController*", then we are not closing "
> *indexCfs.readOrdering*", which means it will remain
> inaccurate forever.
>
>
> Jaydeep
>
> On Wed, Nov 6, 2024 at 5:51 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> I think I'm getting really close now. This seems to have something to do
>> with the "read-hotness-tracker:1" thread. The thread dump is:
>>
>> "read-hotness-tracker:1" daemon prio=5 tid=93 WAITING
>>     at jdk.internal.misc.Unsafe.park(Native Method)
>>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
>>     at
>> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
>>        local variable:
>> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139
>>     at
>> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
>>        local variable:
>> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139
>>     at
>> org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
>>        local variable:
>> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139
>>     at
>> org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259)
>>     at
>> org.apache.cassandra.db.commitlog.AbstractCommitLogService.awaitSyncAt(AbstractCommitLogService.java:324)
>> <------ here
>>        local variable:
>> org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1
>>        local variable: com.codahale.metrics.Timer$Context#2086795
>>     at
>> org.apache.cassandra.db.commitlog.PeriodicCommitLogService.maybeWaitForSync(PeriodicCommitLogService.java:42)
>>        local variable:
>> org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1
>>     at
>> org.apache.cassandra.db.commitlog.AbstractCommitLogService.finishWriteFor(AbstractCommitLogService.java:284)
>>        local variable:
>> org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1
>>     at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:307)
>>        local variable: org.apache.cassandra.db.commitlog.CommitLog#1
>>        local variable:
>> org.apache.cassandra.io.util.DataOutputBuffer$1$1#61
>>        local variable:
>> org.apache.cassandra.db.commitlog.CommitLogSegment$Allocation#1
>>     at
>> org.apache.cassandra.db.CassandraKeyspaceWriteHandler.addToCommitLog(CassandraKeyspaceWriteHandler.java:100)
>>     at
>> org.apache.cassandra.db.CassandraKeyspaceWriteHandler.beginWrite(CassandraKeyspaceWriteHandler.java:54)
>> <------ and here
>>        local variable:
>> org.apache.cassandra.utils.concurrent.OpOrder$Group#8162
>>     at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:628)
>>        local variable: org.apache.cassandra.db.Keyspace#8
>>        local variable: org.apache.cassandra.db.Mutation#54491
>>     at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:512)
>>     at org.apache.cassandra.db.Mutation.apply(Mutation.java:228)
>>     at org.apache.cassandra.db.Mutation.apply(Mutation.java:248)
>>     at
>> org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:675)
>>     at
>> org.apache.cassandra.cql3.statements.ModificationStatement.executeLocally(ModificationStatement.java:666)
>>     at
>> org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:447)
>>     at
>> org.apache.cassandra.db.SystemKeyspace.persistSSTableReadMeter(SystemKeyspace.java:1488)
>>        local variable: java.lang.String#42806
>>        local variable: java.lang.String#47659
>>        local variable:
>> org.apache.cassandra.io.sstable.UUIDBasedSSTableId#1389
>>        local variable: org.apache.cassandra.metrics.RestorableMeter#752
>>     at
>> org.apache.cassandra.io.sstable.format.SSTableReader$GlobalTidy$1.run(SSTableReader.java:2170)
>>     at
>> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124)
>>        local variable:
>> org.apache.cassandra.concurrent.ExecutionFailure$1#122029
>>        local variable:
>> org.apache.cassandra.utils.WithResources$None$$Lambda$223#1
>>     at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>>        local variable: java.util.concurrent.Executors$RunnableAdapter#934
>>     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
>>        local variable:
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931
>>     at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>>        local variable:
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>        local variable:
>> org.apache.cassandra.concurrent.ScheduledThreadPoolExecutorPlus#6
>>        local variable: java.util.concurrent.ThreadPoolExecutor$Worker#7
>>        local variable: io.netty.util.concurrent.FastThreadLocalThread#7
>>        local variable:
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>        local variable: java.util.concurrent.ThreadPoolExecutor$Worker#7
>>     at
>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>        local variable: io.netty.util.concurrent.FastThreadLocalRunnable#7
>>     at java.lang.Thread.run(Thread.java:829)
>>        local variable: io.netty.util.concurrent.FastThreadLocalThread#7
>>
>>
>> In the CassandraKeyspaceWriteHandler.beginWrite() method:
>>
>> public WriteContext beginWrite(Mutation mutation, boolean makeDurable)
>> throws RequestExecutionException
>> {
>>     OpOrder.Group group = null;
>>     try
>>     {
>>         group = Keyspace.writeOrder.start();  // <--- here
>>
>>         // write the mutation to the commitlog and memtables
>>         CommitLogPosition position = null;
>>         if (makeDurable)
>>         {
>>             position = addToCommitLog(mutation);  // <--- and here
>>         }
>>         return new CassandraWriteContext(group, position);
>>     }
>>     catch (Throwable t)
>>     {
>>         if (group != null)
>>         {
>>             group.close();
>>         }
>>         throw t;
>>     }
>> }
>>
>> The Keyspace.writeOrder.start() method call blocks the
>> MemtableFlushWriter thread, and then it calls addToCommitLog(mutation) which
>> indirectly calls AbstractCommitLogService.awaitSyncAt()
>>
>>     void awaitSyncAt(long syncTime, Context context)
>>     {
>>         do
>>         {
>>             WaitQueue.Signal signal = context != null ?
>> syncComplete.register(context, Context::stop) : syncComplete.register();
>>             if (lastSyncedAt < syncTime)
>>                 signal.awaitUninterruptibly();  // <--- here
>>             else
>>                 signal.cancel();
>>         }
>>         while (lastSyncedAt < syncTime);
>>     }
>>
>> This then got stuck on the signal.awaitUninterruptibly()
>>
>> Now I know what is blocking the memtable flushing, but I haven't been
>> able to figure out is why it got stuck on waiting for that signal.
>>
>> I would appreciate it if anyone can offer some insight here.
>>
>>
>> On 05/11/2024 17:48, Bowen Song via user wrote:
>>
>> I will give it a try and see what I can find. I plan to go down the
>> rabbit hole tomorrow. Will keep you updated.
>> On 05/11/2024 17:34, Jeff Jirsa wrote:
>>
>>
>>
>> On Nov 5, 2024, at 4:12 AM, Bowen Song via user
>> <user@cassandra.apache.org> <user@cassandra.apache.org> wrote:
>>
>> Writes on this node starts to timeout and fail. But if left untouched,
>> it's only gonna get worse, and eventually lead to JVM OOM and crash.
>>
>> By inspecting the heap dump created at OOM, we can see that both of the
>> MemtableFlushWriter threads are stuck on line 1190
>> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190>
>> in the ColumnFamilyStore.java:
>>
>>             // mark writes older than the barrier as blocking progress,
>> permitting them to exceed our memory limit
>>             // if they are stuck waiting on it, then wait for them all to
>> complete
>>             writeBarrier.markBlocking();
>>             writeBarrier.await();   // <----------- stuck here
>>
>> And the MemtablePostFlush thread is stuck on line 1094
>> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094>
>> in the same file.
>>
>>             try
>>             {
>>                 // we wait on the latch for the commitLogUpperBound to be
>> set, and so that waiters
>>                 // on this task can rely on all prior flushes being
>> complete
>>                 latch.await();   // <----------- stuck here
>>             }
>> Our top suspect is CDC interacting with repair, since this started to
>> happen shortly after we enabled CDC on the nodes, and each time repair was
>> running. But we have not been able to reproduce this in a testing cluster,
>> and don't know what's the next step to troubleshoot this issue. So I'm
>> posting it in the mailing lists and hoping someone may know something about
>> it or point me to the right direction.
>>
>>
>> Wouldn’t be completely surprised if  CDC  or repair somehow has a
>> barrier, I’ve also seen similar behavior pre-3.0 with “very long running
>> read commands” that have a barrier on the memtable that prevent release.
>>
>> You’ve got the heap (great, way better than most people debugging), are
>> you able to navigate through it and look for references to that memtable or
>> other things holding a barrier?
>>
>>
>>
>>
>>

Re: Unexplained stuck memtable flush

Reply via email to