[ https://issues.apache.org/jira/browse/CASSANDRA-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hieu Ngo updated CASSANDRA-20607: --------------------------------- Description: The metrics system at Dropbox.com (Vortex2) uses cassandra as the data storage layer to store metric and metadata data. The system is undergoing a migration project to move cassandra hosted on premise to an in-house kubernetes-based deployment system. Whenever a pod running cassandra is rotated out during a deployment, we have seen many nodes deemed as Down Normal pretty much at the same time for a brief period (15 - 30s), resulting in excessive hints created, timed out and unavailable exceptions. It turns out that the bug is at the forceBlockingFlush call on system.peers and system.peers_v2 tables happening within the single-threaded Gossip Stage. The flush takes too long, occupying the Gossip STage and causing gossip tasks to pile up. Since gossip tasks aren't sent properly and timely, nodes in the system start marking other nodes as Down Normal. From the stack trace, we believe the forceBlockFlush is called as part of the host ip update when the rotated pod changes its IP. We have created a patch internally to change the flush to nonBlockingFlush if the flag cassandra.load_ring_state=false. The rationale for the fix is that since the ring state isn't loaded from those peers tables during startup, there is no need to wait for the flush. However, we haven't figured out why the blocking flush to system.peers and system.peers_v2 tables takes so long. The asks for this bug are: 1. Figure out why blocking flush to system.peers and system.peers_v2 tables takes so long. 2. Create a fix to change the flush to non-blocking if cassandra.load_ring_state is set to false. We notice that the cassandra.load_ring_state is removed in cassandra 5.0 so the fix may only be applicable to 4.x and older versions. Here is the stack trace of GossipStage: "GossipStage:1" #205 daemon prio=5 os_prio=0 cpu=113114.96ms elapsed=233376.38s tid=0x00007fad2002c480 nid=0x3a5 waiting on condition [0x00007faf9613d000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.25/Native Method) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.25/LockSupport.java:323) at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) at org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306) at org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154) at org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:46) at org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:245) at org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:501) at org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:477) at org.apache.cassandra.db.SystemKeyspace.forceBlockingFlush(SystemKeyspace.java:947) at org.apache.cassandra.db.SystemKeyspace.removeEndpoint(SystemKeyspace.java:916) - locked <0x0000000101dfccc0> (a java.lang.Class for org.apache.cassandra.db.SystemKeyspace) at org.apache.cassandra.service.StorageService.removeEndpoint(StorageService.java:3339) at org.apache.cassandra.service.StorageService.updateTokenMetadata(StorageService.java:3052) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3176) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2638) at org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1711) at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3652) at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1481) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1627) at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50) at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78) at org.apache.cassandra.net.InboundSink$$Lambda$603/0x000000080069c040.accept(Unknown Source) at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97) at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45) at org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430) at org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.25/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.25/ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(java.base@11.0.25/Thread.java:829) Here is the PerDiskMemtableFlushWriter that the GossipStage was waiting for: "PerDiskMemtableFlushWriter_0:470" #170632 daemon prio=5 os_prio=0 cpu=42316.16ms elapsed=44.53s tid=0x00007faee40c91c0 nid=0x79f5 runnable [0x00007faf70613000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.db.marshal.AbstractType.writeValue(AbstractType.java:469) at org.apache.cassandra.db.ClusteringPrefix$Serializer.serializeValuesWithoutSize(ClusteringPrefix.java:397) at org.apache.cassandra.db.Clustering$Serializer.serialize(Clustering.java:133) at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:188) at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137) at org.apache.cassandra.db.ColumnIndex.add(ColumnIndex.java:258) at org.apache.cassandra.db.ColumnIndex.buildRowIndex(ColumnIndex.java:122) at org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:222) at org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48) at org.apache.cassandra.db.memtable.Flushing$FlushRunnable.writeSortedContents(Flushing.java:163) at org.apache.cassandra.db.memtable.Flushing$FlushRunnable.call(Flushing.java:183) at org.apache.cassandra.db.memtable.Flushing$FlushRunnable.call(Flushing.java:122) at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.25/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.25/ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(java.base@11.0.25/Thread.java:829) was: The metrics system at Dropbox.com (Vortex2) uses cassandra as the data storage layer to store metric and metadata data. The system is undergoing a migration project to move cassandra hosted on premise to an in-house kubernetes-based deployment system. Whenever a pod running cassandra is rotated out during a deployment, we have seen many nodes deemed as Down Normal pretty much at the same time for a brief period (15 - 30s), resulting in excessive hints created, timed out and unavailable exceptions. It turns out that the bug is at the forceBlockingFlush call on system.peers and system.peers_v2 tables happening within the single-threaded Gossip Stage. The flush takes too long, occupying the Gossip STage and causing gossip tasks to pile up. Since gossip tasks aren't sent properly and timely, nodes in the system start marking other nodes as Down Normal. From the stack trace, we believe the forceBlockFlush is called as part of the host ip update when the rotated pod changes its IP. We have created a patch internally to change the flush to nonBlockingFlush if the flag cassandra.load_ring_state=false. The rationale for the fix is that since the ring state isn't loaded from those peers tables during startup, there is no need to wait for the flush. However, we haven't figured out why the blocking flush to system.peers and system.peers_v2 tables takes so long. The asks for this bug is: 1. Figure out why blocking flush to system.peers and system.peers_v2 tables takes so long. 2. Create a fix to change the flush to non-blocking if cassandra.load_ring_state is set to false. We notice that the cassandra.load_ring_state is removed in cassandra 5.0 so the fix may only be applicable to 4.x and older versions. Here is the stack trace of GossipStage: "GossipStage:1" #205 daemon prio=5 os_prio=0 cpu=113114.96ms elapsed=233376.38s tid=0x00007fad2002c480 nid=0x3a5 waiting on condition [0x00007faf9613d000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.25/Native Method) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.25/LockSupport.java:323) at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) at org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306) at org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154) at org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:46) at org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:245) at org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:501) at org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:477) at org.apache.cassandra.db.SystemKeyspace.forceBlockingFlush(SystemKeyspace.java:947) at org.apache.cassandra.db.SystemKeyspace.removeEndpoint(SystemKeyspace.java:916) - locked <0x0000000101dfccc0> (a java.lang.Class for org.apache.cassandra.db.SystemKeyspace) at org.apache.cassandra.service.StorageService.removeEndpoint(StorageService.java:3339) at org.apache.cassandra.service.StorageService.updateTokenMetadata(StorageService.java:3052) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3176) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2638) at org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1711) at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3652) at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1481) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1627) at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50) at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78) at org.apache.cassandra.net.InboundSink$$Lambda$603/0x000000080069c040.accept(Unknown Source) at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97) at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45) at org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430) at org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.25/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.25/ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(java.base@11.0.25/Thread.java:829) Here is the PerDiskMemtableFlushWriter that the GossipStage was waiting for: "PerDiskMemtableFlushWriter_0:470" #170632 daemon prio=5 os_prio=0 cpu=42316.16ms elapsed=44.53s tid=0x00007faee40c91c0 nid=0x79f5 runnable [0x00007faf70613000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.db.marshal.AbstractType.writeValue(AbstractType.java:469) at org.apache.cassandra.db.ClusteringPrefix$Serializer.serializeValuesWithoutSize(ClusteringPrefix.java:397) at org.apache.cassandra.db.Clustering$Serializer.serialize(Clustering.java:133) at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:188) at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137) at org.apache.cassandra.db.ColumnIndex.add(ColumnIndex.java:258) at org.apache.cassandra.db.ColumnIndex.buildRowIndex(ColumnIndex.java:122) at org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:222) at org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48) at org.apache.cassandra.db.memtable.Flushing$FlushRunnable.writeSortedContents(Flushing.java:163) at org.apache.cassandra.db.memtable.Flushing$FlushRunnable.call(Flushing.java:183) at org.apache.cassandra.db.memtable.Flushing$FlushRunnable.call(Flushing.java:122) at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.25/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.25/ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(java.base@11.0.25/Thread.java:829) > Gossip thread slows down when forceBlockingFlush is called on system.peers > and system.peers_v2 tables > ----------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-20607 > URL: https://issues.apache.org/jira/browse/CASSANDRA-20607 > Project: Apache Cassandra > Issue Type: Bug > Reporter: Hieu Ngo > Priority: Normal > > The metrics system at Dropbox.com (Vortex2) uses cassandra as the data > storage layer to store metric and metadata data. The system is undergoing a > migration project to move cassandra hosted on premise to an in-house > kubernetes-based deployment system. > Whenever a pod running cassandra is rotated out during a deployment, we have > seen many nodes deemed as Down Normal pretty much at the same time for a > brief period (15 - 30s), resulting in excessive hints created, timed out and > unavailable exceptions. > It turns out that the bug is at the forceBlockingFlush call on system.peers > and system.peers_v2 tables happening within the single-threaded Gossip Stage. > The flush takes too long, occupying the Gossip STage and causing gossip tasks > to pile up. Since gossip tasks aren't sent properly and timely, nodes in the > system start marking other nodes as Down Normal. From the stack trace, we > believe the forceBlockFlush is called as part of the host ip update when the > rotated pod changes its IP. > We have created a patch internally to change the flush to nonBlockingFlush if > the flag cassandra.load_ring_state=false. The rationale for the fix is that > since the ring state isn't loaded from those peers tables during startup, > there is no need to wait for the flush. However, we haven't figured out why > the blocking flush to system.peers and system.peers_v2 tables takes so long. > The asks for this bug are: > 1. Figure out why blocking flush to system.peers and system.peers_v2 tables > takes so long. > 2. Create a fix to change the flush to non-blocking if > cassandra.load_ring_state is set to false. We notice that the > cassandra.load_ring_state is removed in cassandra 5.0 so the fix may only be > applicable to 4.x and older versions. > > Here is the stack trace of GossipStage: > > "GossipStage:1" #205 daemon prio=5 os_prio=0 cpu=113114.96ms > elapsed=233376.38s tid=0x00007fad2002c480 nid=0x3a5 waiting on condition > [0x00007faf9613d000] > java.lang.Thread.State: WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@11.0.25/Native Method) > at > java.util.concurrent.locks.LockSupport.park(java.base@11.0.25/LockSupport.java:323) > at > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) > at > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) > at > org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306) > at > org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154) > at > org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:46) > at > org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:245) > at org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:501) > at org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:477) > at > org.apache.cassandra.db.SystemKeyspace.forceBlockingFlush(SystemKeyspace.java:947) > at > org.apache.cassandra.db.SystemKeyspace.removeEndpoint(SystemKeyspace.java:916) > - locked <0x0000000101dfccc0> (a java.lang.Class for > org.apache.cassandra.db.SystemKeyspace) > at > org.apache.cassandra.service.StorageService.removeEndpoint(StorageService.java:3339) > at > org.apache.cassandra.service.StorageService.updateTokenMetadata(StorageService.java:3052) > at > org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3176) > at > org.apache.cassandra.service.StorageService.onChange(StorageService.java:2638) > at > org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1711) > at > org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3652) > at > org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1481) > at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1627) > at > org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50) > at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78) > at > org.apache.cassandra.net.InboundSink$$Lambda$603/0x000000080069c040.accept(Unknown > Source) > at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97) > at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45) > at > org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430) > at > org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.25/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.25/ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(java.base@11.0.25/Thread.java:829) > > Here is the PerDiskMemtableFlushWriter that the GossipStage was waiting for: > > "PerDiskMemtableFlushWriter_0:470" #170632 daemon prio=5 os_prio=0 > cpu=42316.16ms elapsed=44.53s tid=0x00007faee40c91c0 nid=0x79f5 runnable > [0x00007faf70613000] > java.lang.Thread.State: RUNNABLE > at > org.apache.cassandra.db.marshal.AbstractType.writeValue(AbstractType.java:469) > at > org.apache.cassandra.db.ClusteringPrefix$Serializer.serializeValuesWithoutSize(ClusteringPrefix.java:397) > at > org.apache.cassandra.db.Clustering$Serializer.serialize(Clustering.java:133) > at > org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:188) > at > org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137) > at org.apache.cassandra.db.ColumnIndex.add(ColumnIndex.java:258) > at org.apache.cassandra.db.ColumnIndex.buildRowIndex(ColumnIndex.java:122) > at > org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:222) > at > org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48) > at > org.apache.cassandra.db.memtable.Flushing$FlushRunnable.writeSortedContents(Flushing.java:163) > at > org.apache.cassandra.db.memtable.Flushing$FlushRunnable.call(Flushing.java:183) > at > org.apache.cassandra.db.memtable.Flushing$FlushRunnable.call(Flushing.java:122) > at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.25/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.25/ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(java.base@11.0.25/Thread.java:829) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org