Re: Unexplained stuck memtable flush

Bowen Song via user Wed, 06 Nov 2024 10:56:47 -0800

Sorry, I mistaken CassandraTableWriteHandler forCassandraKeyspaceWriteHandler, and thought the code has changed. I waswrong, the code has not changed, they are two different files.


Your read-hotness-tracker thread dump is:


"read-hotness-tracker:1" - Thread t@113
java.lang.Thread.State: WAITING
at java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native Method)

atjava.base@11.0.20.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)atapp//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)atapp//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)atapp//org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitThrowUncheckedOnInterrupt(Awaitable.java:131)atapp//org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitThrowUncheckedOnInterrupt(Awaitable.java:235)atapp//org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.allocate(MemtableAllocator.java:195)atapp//org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:89)atapp//org.apache.cassandra.utils.memory.MemtableBufferAllocator$1.allocate(MemtableBufferAllocator.java:40)atapp//org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:77)atapp//org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:63)atapp//org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:46)atapp//org.apache.cassandra.db.memtable.SkipListMemtable.put(SkipListMemtable.java:119)atapp//org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1431)atapp//org.apache.cassandra.db.CassandraTableWriteHandler.write(CassandraTableWriteHandler.java:40)

at app//org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:783)
at app//org.apache.cassandra.db.Keyspace.apply(Keyspace.java:616)
at app//org.apache.cassandra.db.Mutation.apply(Mutation.java:256)
at app//org.apache.cassandra.db.Mutation.apply(Mutation.java:276)

atapp//org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:696)atapp//org.apache.cassandra.cql3.statements.ModificationStatement.executeLocally(ModificationStatement.java:687)atapp//org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:488)atapp//org.apache.cassandra.db.SystemKeyspace.persistSSTableReadMeter(SystemKeyspace.java:1509)atapp//org.apache.cassandra.io.sstable.format.SSTableReader$GlobalTidy$1.run(SSTableReader.java:2176)atapp//org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)atjava.base@11.0.20.1/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)atjava.base@11.0.20.1/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)atjava.base@11.0.20.1/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)atjava.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)atjava.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)atapp//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

at java.base@11.0.20.1/java.lang.Thread.run(Thread.java:829)
Locked ownable synchronizers:
- Locked <76304b81> (a java.util.concurrent.ThreadPoolExecutor$Worker)

It went from Keyspace.applyInternal() intoCassandraTableWriteHandler.write(), that isafter CassandraKeyspaceWriteHandler.beginWrite() has returnedsuccessfully, and it has yet to closed the barrier. This is unlike oursituation in which the CassandraKeyspaceWriteHandler.beginWrite() gotstuck on waiting for a signal.


Base on my understanding of it, these two issues have different causes.


On 06/11/2024 18:36, Bowen Song via user wrote:

I can see some similarities and some differences between your threaddump and ours.


In your thread dump:

  * no MemtableFlushWriter thread
  * the MemtablePostFlush thread is idle
  * the MemtableReclaimMemory thread is waiting for a barrier,
    possibly stuck
  * the read-hotness-tracker thread is waiting for memtable memory
    allocation. I'm not sure whether it's blocking a barrier, the
    relevant part of the source code has changed between 4.1.1 and 4.1.3.

In our thread dump:

  * two MemtableFlushWriter threads are stuck, each waiting for a barrier
  * the MemtablePostFlush thread is stuck, waiting for a latch
  * the MemtableReclaimMemory thread is idle
  * the read-hotness-tracker thread is waiting for a signal, and it's
    blocking the barrier one of the MemtableFlushWriter threads is
    waiting on

It is possible that in both cases the read-hotness-tracker thread wasblocking the barrier, which caused the MemtableFlushWriter orMemtableReclaimMemory thread to get stuck. It is also totally possiblethat your case has nothing to do with the read-hotness-tracker thread.

I would recommend you try to find out what was blocking the barrierMemtableReclaimMemory thread is waiting on.



On 06/11/2024 16:29, Jaydeep Chovatia wrote:

We have also seen this issue a few times in our production (4.1). Myteammates added a thread dump here<https://issues.apache.org/jira/browse/CASSANDRA-19564?focusedCommentId=17893987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17893987>.

One of my theories is that the MemtableReclaimMemory thread has beenunable to make progress:


"MemtableReclaimMemory:1" - Thread t@56

java.lang.Thread.State: WAITING

at java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native<http://java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native> Method)

atjava.base@11.0.20.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)<http://java.base@11.0.20.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)>

atapp//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)

atapp//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)

atapp//org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)

atapp//org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259)

atapp//org.apache.cassandra.utils.concurrent.OpOrder$Group.await(OpOrder.java:267)

atapp//org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:425)

atapp//org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1357)

atapp//org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

atapp//org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)

atjava.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)<http://java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)>

atjava.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)<http://java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)>

atapp//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

at java.base@11.0.20.1/java.lang.Thread.run(Thread.java:829)<http://java.base@11.0.20.1/java.lang.Thread.run(Thread.java:829)>

One of the possible bugs I could see is a line<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/ReadExecutionController.java#L141C77-L141C97>/ indexController = new ReadExecutionController(command,indexCfs.readOrdering.start(), indexCfs.metadata(), null, null,NO_SAMPLING, false);/If "/indexCfs.readOrdering.start()/" succeeded but the constructor"/new ReadExecutionController/", then we are not closing"/indexCfs.readOrdering/", which means it will remain

inaccurate forever.


Jaydeep

On Wed, Nov 6, 2024 at 5:51 AM Bowen Song via user<user@cassandra.apache.org> wrote:


    I think I'm getting really close now. This seems to have
    something to do with the "read-hotness-tracker:1" thread. The
    thread dump is:

    "read-hotness-tracker:1" daemon prio=5 tid=93 WAITING
        at jdk.internal.misc.Unsafe.park(Native Method)
        at
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
        at
    
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
           local variable:
    
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139
        at
    
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
           local variable:
    
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139
        at
    
org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
           local variable:
    
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139
        at
    
org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259)
        at
    
org.apache.cassandra.db.commitlog.AbstractCommitLogService.awaitSyncAt(AbstractCommitLogService.java:324)
     
    <------ here
           local variable:
    org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1
           local variable: com.codahale.metrics.Timer$Context#2086795
        at
    
org.apache.cassandra.db.commitlog.PeriodicCommitLogService.maybeWaitForSync(PeriodicCommitLogService.java:42)
           local variable:
    org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1
        at
    
org.apache.cassandra.db.commitlog.AbstractCommitLogService.finishWriteFor(AbstractCommitLogService.java:284)
           local variable:
    org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1
        at
    org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:307)
           local variable: org.apache.cassandra.db.commitlog.CommitLog#1
           local variable:
    org.apache.cassandra.io.util.DataOutputBuffer$1$1#61
           local variable:
    org.apache.cassandra.db.commitlog.CommitLogSegment$Allocation#1
        at
    
org.apache.cassandra.db.CassandraKeyspaceWriteHandler.addToCommitLog(CassandraKeyspaceWriteHandler.java:100)
        at
    
org.apache.cassandra.db.CassandraKeyspaceWriteHandler.beginWrite(CassandraKeyspaceWriteHandler.java:54)
    <------ and here
           local variable:
    org.apache.cassandra.utils.concurrent.OpOrder$Group#8162
        at
    org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:628)
           local variable: org.apache.cassandra.db.Keyspace#8
           local variable: org.apache.cassandra.db.Mutation#54491
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:512)
        at org.apache.cassandra.db.Mutation.apply(Mutation.java:228)
        at org.apache.cassandra.db.Mutation.apply(Mutation.java:248)
        at
    
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:675)
        at
    
org.apache.cassandra.cql3.statements.ModificationStatement.executeLocally(ModificationStatement.java:666)
        at
    
org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:447)
        at
    
org.apache.cassandra.db.SystemKeyspace.persistSSTableReadMeter(SystemKeyspace.java:1488)
           local variable: java.lang.String#42806
           local variable: java.lang.String#47659
           local variable:
    org.apache.cassandra.io.sstable.UUIDBasedSSTableId#1389
           local variable:
    org.apache.cassandra.metrics.RestorableMeter#752
        at
    
org.apache.cassandra.io.sstable.format.SSTableReader$GlobalTidy$1.run(SSTableReader.java:2170)
        at
    
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124)
           local variable:
    org.apache.cassandra.concurrent.ExecutionFailure$1#122029
           local variable:
    org.apache.cassandra.utils.WithResources$None$$Lambda$223#1
        at
    java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
           local variable:
    java.util.concurrent.Executors$RunnableAdapter#934
        at
    java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
           local variable:
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931
        at
    
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
           local variable:
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931
        at
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
           local variable:
    org.apache.cassandra.concurrent.ScheduledThreadPoolExecutorPlus#6
           local variable:
    java.util.concurrent.ThreadPoolExecutor$Worker#7
           local variable:
    io.netty.util.concurrent.FastThreadLocalThread#7
           local variable:
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931
        at
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
           local variable:
    java.util.concurrent.ThreadPoolExecutor$Worker#7
        at
    
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
           local variable:
    io.netty.util.concurrent.FastThreadLocalRunnable#7
        at java.lang.Thread.run(Thread.java:829)
           local variable:
    io.netty.util.concurrent.FastThreadLocalThread#7


    In the CassandraKeyspaceWriteHandler.beginWrite() method:

    public WriteContext beginWrite(Mutation mutation, boolean
    makeDurable) throws RequestExecutionException
    {
        OpOrder.Group group = null;
        try
        {
            group = Keyspace.writeOrder.start();  // <--- here

            // write the mutation to the commitlog and memtables
            CommitLogPosition position = null;
            if (makeDurable)
            {
                position = addToCommitLog(mutation);  // <--- and here
            }
            return new CassandraWriteContext(group, position);
        }
        catch (Throwable t)
        {
            if (group != null)
            {
                group.close();
            }
            throw t;
        }
    }

    The Keyspace.writeOrder.start() method call blocks the
    MemtableFlushWriter thread, and then it calls
    addToCommitLog(mutation) which indirectly calls
    AbstractCommitLogService.awaitSyncAt()

        void awaitSyncAt(long syncTime, Context context)
        {
            do
            {
                WaitQueue.Signal signal = context != null ?
    syncComplete.register(context, Context::stop) :
    syncComplete.register();
                if (lastSyncedAt < syncTime)
                    signal.awaitUninterruptibly();  // <--- here
                else
                    signal.cancel();
            }
            while (lastSyncedAt < syncTime);
        }

    This then got stuck on the signal.awaitUninterruptibly()

    Now I know what is blocking the memtable flushing, but I haven't
    been able to figure out is why it got stuck on waiting for that
    signal.

    I would appreciate it if anyone can offer some insight here.


    On 05/11/2024 17:48, Bowen Song via user wrote:


    I will give it a try and see what I can find. I plan to go down
    the rabbit hole tomorrow. Will keep you updated.

    On 05/11/2024 17:34, Jeff Jirsa wrote:

    On Nov 5, 2024, at 4:12 AM, Bowen Song via user
    <user@cassandra.apache.org> <mailto:user@cassandra.apache.org>
    wrote:

    Writes on this node starts to timeout and fail. But if left
    untouched, it's only gonna get worse, and eventually lead to
    JVM OOM and crash.

    By inspecting the heap dump created at OOM, we can see that
    both of the MemtableFlushWriter threads are stuck on line 1190
    
<https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190>
    in the ColumnFamilyStore.java:

                // mark writes older than the barrier as blocking
    progress, permitting them to exceed our memory limit
                // if they are stuck waiting on it, then wait for
    them all to complete
                writeBarrier.markBlocking();
                writeBarrier.await();   // <----------- stuck here

    And the MemtablePostFlush thread is stuck on line 1094
    
<https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094>
    in the same file.

                try
                {
                    // we wait on the latch for the
    commitLogUpperBound to be set, and so that waiters
                    // on this task can rely on all prior flushes
    being complete
                    latch.await();   // <----------- stuck here
                }
    Our top suspect is CDC interacting with repair, since this
    started to happen shortly after we enabled CDC on the nodes,
    and each time repair was running. But we have not been able to
    reproduce this in a testing cluster, and don't know what's the
    next step to troubleshoot this issue. So I'm posting it in the
    mailing lists and hoping someone may know something about it
    or point me to the right direction.


    Wouldn’t be completely surprised if  CDC  or repair somehow has
    a barrier, I’ve also seen similar behavior pre-3.0 with “very
    long running read commands” that have a barrier on the memtable
    that prevent release.

    You’ve got the heap (great, way better than most people
    debugging), are you able to navigate through it and look for
    references to that memtable or other things holding a barrier?

Re: Unexplained stuck memtable flush

Reply via email to