Hieu Ngo created CASSANDRA-20607:
------------------------------------

             Summary: Gossip thread slows down when forceBlockingFlush is 
called on system.peers and system.peers_v2 tables
                 Key: CASSANDRA-20607
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20607
             Project: Apache Cassandra
          Issue Type: Bug
            Reporter: Hieu Ngo


The metrics system at Dropbox.com (Vortex2) uses cassandra as the data storage 
layer to store metric and metadata data. The system is undergoing a migration 
project to move cassandra hosted on premise to an in-house kubernetes-based 
deployment system.

Whenever a pod running cassandra is rotated out during a deployment, we have 
seen many nodes deemed as Down Normal pretty much at the same time for a brief 
period (15 - 30s), resulting in excessive hints created, timed out and 
unavailable exceptions.

It turns out that the bug is at the forceBlockingFlush call on system.peers and 
system.peers_v2 tables happening within the single-threaded Gossip Stage. The 
flush takes too long, occupying the Gossip STage and causing gossip tasks to 
pile up. Since gossip tasks aren't sent properly and timely, nodes in the 
system start marking other nodes as Down Normal. From the stack trace, we 
believe the forceBlockFlush is called as part of the host ip update when the 
rotated pod changes its IP.

We have created a patch internally to change the flush to nonBlockingFlush if 
the flag cassandra.load_ring_state=false. The rationale for the fix is that 
since the ring state isn't loaded from those peers tables during startup, there 
is no need to wait for the flush. However, we haven't figured out why the 
blocking flush to system.peers and system.peers_v2 tables takes so long. 

The asks for this bug is:
1. Figure out why blocking flush to system.peers and system.peers_v2 tables 
takes so long.

2. Create a fix to change the flush to non-blocking if 
cassandra.load_ring_state is set to false. We notice that the 
cassandra.load_ring_state is removed in cassandra 5.0 so the fix may only be 
applicable to 4.x and older versions.

 

Here is the stack trace of GossipStage:

 
"GossipStage:1" #205 daemon prio=5 os_prio=0 cpu=113114.96ms elapsed=233376.38s 
tid=0x00007fad2002c480 nid=0x3a5 waiting on condition  [0x00007faf9613d000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.25/Native Method)
        at 
java.util.concurrent.locks.LockSupport.park(java.base@11.0.25/LockSupport.java:323)
        at 
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
        at 
org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
        at 
org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
        at 
org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154)
        at 
org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:46)
        at 
org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:245)
        at 
org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:501)
        at 
org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:477)
        at 
org.apache.cassandra.db.SystemKeyspace.forceBlockingFlush(SystemKeyspace.java:947)
        at 
org.apache.cassandra.db.SystemKeyspace.removeEndpoint(SystemKeyspace.java:916)
        - locked <0x0000000101dfccc0> (a java.lang.Class for 
org.apache.cassandra.db.SystemKeyspace)
        at 
org.apache.cassandra.service.StorageService.removeEndpoint(StorageService.java:3339)
        at 
org.apache.cassandra.service.StorageService.updateTokenMetadata(StorageService.java:3052)
        at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3176)
        at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:2638)
        at 
org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1711)
        at 
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3652)
        at 
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1481)
        at 
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1627)
        at 
org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
        at 
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
        at 
org.apache.cassandra.net.InboundSink$$Lambda$603/0x000000080069c040.accept(Unknown
 Source)
        at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
        at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
        at 
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
        at 
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.25/ThreadPoolExecutor.java:1128)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.25/ThreadPoolExecutor.java:628)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(java.base@11.0.25/Thread.java:829)
 

Here is the PerDiskMemtableFlushWriter that the GossipStage was waiting for:

 
"PerDiskMemtableFlushWriter_0:470" #170632 daemon prio=5 os_prio=0 
cpu=42316.16ms elapsed=44.53s tid=0x00007faee40c91c0 nid=0x79f5 runnable 
[0x00007faf70613000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.cassandra.db.marshal.AbstractType.writeValue(AbstractType.java:469)
        at 
org.apache.cassandra.db.ClusteringPrefix$Serializer.serializeValuesWithoutSize(ClusteringPrefix.java:397)
        at 
org.apache.cassandra.db.Clustering$Serializer.serialize(Clustering.java:133)
        at 
org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:188)
        at 
org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137)
        at org.apache.cassandra.db.ColumnIndex.add(ColumnIndex.java:258)
        at 
org.apache.cassandra.db.ColumnIndex.buildRowIndex(ColumnIndex.java:122)
        at 
org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:222)
        at 
org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
        at 
org.apache.cassandra.db.memtable.Flushing$FlushRunnable.writeSortedContents(Flushing.java:163)
        at 
org.apache.cassandra.db.memtable.Flushing$FlushRunnable.call(Flushing.java:183)
        at 
org.apache.cassandra.db.memtable.Flushing$FlushRunnable.call(Flushing.java:122)
        at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
        at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.25/ThreadPoolExecutor.java:1128)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.25/ThreadPoolExecutor.java:628)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(java.base@11.0.25/Thread.java:829)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to