Dan,
As part of upgrade, did you upgrade the sstables?

Sent from mobile. Please excuse typos

On 28 Sep 2017 17:45, "Dan Kinder" <dkin...@turnitin.com> wrote:

> I should also note, I also see nodes become locked up without seeing that
> Exception. But the GossipStage buildup does seem correlated with gossip
> activity, e.g. me restarting a different node.
>
> On Thu, Sep 28, 2017 at 9:20 AM, Dan Kinder <dkin...@turnitin.com> wrote:
>
>> Hi,
>>
>> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
>> following. The cluster does function, for a while, but then some stages
>> begin to back up and the node does not recover and does not drain the
>> tasks, even under no load. This happens both to MutationStage and
>> GossipStage.
>>
>> I do see the following exception happen in the logs:
>>
>>
>> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
>> CassandraDaemon.java:228 - Exception in thread
>> Thread[ReadRepairStage:2328,5,main]
>>
>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
>> out - received only 1 responses.
>>
>>         at 
>> org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at org.apache.cassandra.db.partitions.UnfilteredPartitionIterat
>> ors$2.close(UnfilteredPartitionIterators.java:182)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at 
>> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at 
>> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThr
>> ow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at 
>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> ~[na:1.8.0_91]
>>
>>         at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> ~[na:1.8.0_91]
>>
>>         at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$
>> threadLocalDeallocator$0(NamedThreadFactory.java:81)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>>
>>
>> But it's hard to correlate precisely with things going bad. It is also
>> very strange to me since I have both read_repair_chance and
>> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
>> confusing why ReadRepairStage would err.
>>
>> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
>> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
>> If I can't find a resolution I'm going to need to downgrade and restore to
>> backup...
>>
>> The only issue I found that looked similar is https://issues.apache.org/j
>> ira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10.
>>
>>
>> $ nodetool tpstats
>>
>> Pool Name                         Active   Pending      Completed
>> Blocked  All time blocked
>>
>> ReadStage                              0         0         582103
>>   0                 0
>>
>> MiscStage                              0         0              0
>>   0                 0
>>
>> CompactionExecutor                    11        11           2868
>>   0                 0
>>
>> MutationStage                         32   4593678       55057393
>>   0                 0
>>
>> GossipStage                            1      2818         371487
>>   0                 0
>>
>> RequestResponseStage                   0         0        4345522
>>   0                 0
>>
>> ReadRepairStage                        0         0         151473
>>   0                 0
>>
>> CounterMutationStage                   0         0              0
>>   0                 0
>>
>> MemtableFlushWriter                    1        81             76
>>   0                 0
>>
>> MemtablePostFlush                      1       382            139
>>   0                 0
>>
>> ValidationExecutor                     0         0              0
>>   0                 0
>>
>> ViewMutationStage                      0         0              0
>>   0                 0
>>
>> CacheCleanupExecutor                   0         0              0
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_10          0         0             69
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_11          0         0             69
>>   0                 0
>>
>> MemtableReclaimMemory                  0         0             81
>>   0                 0
>>
>> PendingRangeCalculator                 0         0             32
>>   0                 0
>>
>> SecondaryIndexManagement               0         0              0
>>   0                 0
>>
>> HintsDispatcher                        0         0            596
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_1           0         0             69
>>   0                 0
>>
>> Native-Transport-Requests             11         0        4547746
>>   0                67
>>
>> PerDiskMemtableFlushWriter_2           0         0             69
>>   0                 0
>>
>> MigrationStage                         1      1545            586
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_0           0         0             80
>>   0                 0
>>
>> Sampler                                0         0              0
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_5           0         0             69
>>   0                 0
>>
>> InternalResponseStage                  0         0          45432
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_6           0         0             69
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_3           0         0             69
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_4           0         0             69
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_9           0         0             69
>>   0                 0
>>
>> AntiEntropyStage                       0         0              0
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_7           0         0             69
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_8           0         0             69
>>   0                 0
>>
>>
>> Message type           Dropped
>>
>> READ                         0
>>
>> RANGE_SLICE                  0
>>
>> _TRACE                       0
>>
>> HINT                         0
>>
>> MUTATION                     0
>>
>> COUNTER_MUTATION             0
>>
>> BATCH_STORE                  0
>>
>> BATCH_REMOVE                 0
>>
>> REQUEST_RESPONSE             0
>>
>> PAGED_RANGE                  0
>>
>> READ_REPAIR                  0
>>
>>
>> -dan
>>
>
>
>
> --
> Dan Kinder
> Principal Software Engineer
> Turnitin – www.turnitin.com
> dkin...@turnitin.com
>

Reply via email to