Dan, As part of upgrade, did you upgrade the sstables? Sent from mobile. Please excuse typos
On 28 Sep 2017 17:45, "Dan Kinder" <dkin...@turnitin.com> wrote: > I should also note, I also see nodes become locked up without seeing that > Exception. But the GossipStage buildup does seem correlated with gossip > activity, e.g. me restarting a different node. > > On Thu, Sep 28, 2017 at 9:20 AM, Dan Kinder <dkin...@turnitin.com> wrote: > >> Hi, >> >> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the >> following. The cluster does function, for a while, but then some stages >> begin to back up and the node does not recover and does not drain the >> tasks, even under no load. This happens both to MutationStage and >> GossipStage. >> >> I do see the following exception happen in the logs: >> >> >> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440 >> CassandraDaemon.java:228 - Exception in thread >> Thread[ReadRepairStage:2328,5,main] >> >> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed >> out - received only 1 responses. >> >> at >> org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at org.apache.cassandra.db.partitions.UnfilteredPartitionIterat >> ors$2.close(UnfilteredPartitionIterators.java:182) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at >> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at >> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThr >> ow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at >> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> ~[na:1.8.0_91] >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> ~[na:1.8.0_91] >> >> at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$ >> threadLocalDeallocator$0(NamedThreadFactory.java:81) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91] >> >> >> But it's hard to correlate precisely with things going bad. It is also >> very strange to me since I have both read_repair_chance and >> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is >> confusing why ReadRepairStage would err. >> >> Anyone have thoughts on this? It's pretty muddling, and causes nodes to >> lock up. Once it happens Cassandra can't even shut down, I have to kill -9. >> If I can't find a resolution I'm going to need to downgrade and restore to >> backup... >> >> The only issue I found that looked similar is https://issues.apache.org/j >> ira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10. >> >> >> $ nodetool tpstats >> >> Pool Name Active Pending Completed >> Blocked All time blocked >> >> ReadStage 0 0 582103 >> 0 0 >> >> MiscStage 0 0 0 >> 0 0 >> >> CompactionExecutor 11 11 2868 >> 0 0 >> >> MutationStage 32 4593678 55057393 >> 0 0 >> >> GossipStage 1 2818 371487 >> 0 0 >> >> RequestResponseStage 0 0 4345522 >> 0 0 >> >> ReadRepairStage 0 0 151473 >> 0 0 >> >> CounterMutationStage 0 0 0 >> 0 0 >> >> MemtableFlushWriter 1 81 76 >> 0 0 >> >> MemtablePostFlush 1 382 139 >> 0 0 >> >> ValidationExecutor 0 0 0 >> 0 0 >> >> ViewMutationStage 0 0 0 >> 0 0 >> >> CacheCleanupExecutor 0 0 0 >> 0 0 >> >> PerDiskMemtableFlushWriter_10 0 0 69 >> 0 0 >> >> PerDiskMemtableFlushWriter_11 0 0 69 >> 0 0 >> >> MemtableReclaimMemory 0 0 81 >> 0 0 >> >> PendingRangeCalculator 0 0 32 >> 0 0 >> >> SecondaryIndexManagement 0 0 0 >> 0 0 >> >> HintsDispatcher 0 0 596 >> 0 0 >> >> PerDiskMemtableFlushWriter_1 0 0 69 >> 0 0 >> >> Native-Transport-Requests 11 0 4547746 >> 0 67 >> >> PerDiskMemtableFlushWriter_2 0 0 69 >> 0 0 >> >> MigrationStage 1 1545 586 >> 0 0 >> >> PerDiskMemtableFlushWriter_0 0 0 80 >> 0 0 >> >> Sampler 0 0 0 >> 0 0 >> >> PerDiskMemtableFlushWriter_5 0 0 69 >> 0 0 >> >> InternalResponseStage 0 0 45432 >> 0 0 >> >> PerDiskMemtableFlushWriter_6 0 0 69 >> 0 0 >> >> PerDiskMemtableFlushWriter_3 0 0 69 >> 0 0 >> >> PerDiskMemtableFlushWriter_4 0 0 69 >> 0 0 >> >> PerDiskMemtableFlushWriter_9 0 0 69 >> 0 0 >> >> AntiEntropyStage 0 0 0 >> 0 0 >> >> PerDiskMemtableFlushWriter_7 0 0 69 >> 0 0 >> >> PerDiskMemtableFlushWriter_8 0 0 69 >> 0 0 >> >> >> Message type Dropped >> >> READ 0 >> >> RANGE_SLICE 0 >> >> _TRACE 0 >> >> HINT 0 >> >> MUTATION 0 >> >> COUNTER_MUTATION 0 >> >> BATCH_STORE 0 >> >> BATCH_REMOVE 0 >> >> REQUEST_RESPONSE 0 >> >> PAGED_RANGE 0 >> >> READ_REPAIR 0 >> >> >> -dan >> > > > > -- > Dan Kinder > Principal Software Engineer > Turnitin – www.turnitin.com > dkin...@turnitin.com >