Re:

Jeff Jirsa Thu, 28 Sep 2017 14:14:36 -0700

The digest mismatch exception is not a problem, that's why it's only logged
at debug.


As Thomas noted, there's a pretty good chance this is
https://issues.apache.org/jira/browse/CASSANDRA-13754 - if you see a lot of
GCInspector logs indicating GC pauses, that would add confidence to that
diagnosis.  <https://issues.apache.org/jira/browse/CASSANDRA-13754>


On Thu, Sep 28, 2017 at 2:08 PM, Dan Kinder <dkin...@turnitin.com> wrote:

> Thanks for the responses.
>
> @Prem yes this is after the entire cluster is on 3.11, but no I did not
> run upgradesstables yet.
>
> @Thomas no I don't see any major GC going on.
>
> @Jeff yeah it's fully upgraded. I decided to shut the whole thing down and
> bring it back (thankfully this cluster is not serving live traffic). The
> nodes seemed okay for an hour or two, but I see the issue again, without me
> bouncing any nodes. This time it's ReadStage that's building up, and the
> exception I'm seeing in the logs is:
>
> DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242
> - Digest mismatch:
>
> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
> DecoratedKey(6150926370328526396, 696a6374652e6f7267) (
> 2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025)
>
>         at 
> org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at org.apache.cassandra.service.ReadCallback$
> AsyncRepairRunner.run(ReadCallback.java:233)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_71]
>
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_71]
>
>         at org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> [apache-cassandra-3.11.0.jar:3.11.0]
>
>         at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71]
>
>
> Do you think running upgradesstables would help? Or relocatesstables? I
> presumed it shouldn't be necessary for Cassandra to function, just an
> optimization.
>
> On Thu, Sep 28, 2017 at 12:49 PM, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> Dan,
>>
>>
>>
>> do you see any major GC? We have been hit by the following memory leak in
>> our loadtest environment with 3.11.0.
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-13754
>>
>>
>>
>> So, depending on the heap size and uptime, you might get into heap
>> troubles.
>>
>>
>>
>> Thomas
>>
>>
>>
>> *From:* Dan Kinder [mailto:dkin...@turnitin.com]
>> *Sent:* Donnerstag, 28. September 2017 18:20
>> *To:* user@cassandra.apache.org
>> *Subject:*
>>
>>
>>
>> Hi,
>>
>> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
>> following. The cluster does function, for a while, but then some stages
>> begin to back up and the node does not recover and does not drain the
>> tasks, even under no load. This happens both to MutationStage and
>> GossipStage.
>>
>> I do see the following exception happen in the logs:
>>
>>
>>
>> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
>> CassandraDaemon.java:228 - Exception in thread
>> Thread[ReadRepairStage:2328,5,main]
>>
>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
>> out - received only 1 responses.
>>
>>         at 
>> org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at org.apache.cassandra.db.partitions.UnfilteredPartitionIterat
>> ors$2.close(UnfilteredPartitionIterators.java:182)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at 
>> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at 
>> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThr
>> ow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at 
>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> ~[na:1.8.0_91]
>>
>>         at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> ~[na:1.8.0_91]
>>
>>         at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$
>> threadLocalDeallocator$0(NamedThreadFactory.java:81)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>>         at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>>
>>
>>
>> But it's hard to correlate precisely with things going bad. It is also
>> very strange to me since I have both read_repair_chance and
>> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
>> confusing why ReadRepairStage would err.
>>
>> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
>> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
>> If I can't find a resolution I'm going to need to downgrade and restore to
>> backup...
>>
>> The only issue I found that looked similar is https://issues.apache.org/j
>> ira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10.
>>
>>
>>
>> $ nodetool tpstats
>>
>> Pool Name                         Active   Pending      Completed
>> Blocked  All time blocked
>>
>> ReadStage                              0         0         582103
>>   0                 0
>>
>> MiscStage                              0         0              0
>>   0                 0
>>
>> CompactionExecutor                    11        11           2868
>>   0                 0
>>
>> MutationStage                         32   4593678       55057393
>>   0                 0
>>
>> GossipStage                            1      2818         371487
>>   0                 0
>>
>> RequestResponseStage                   0         0        4345522
>>   0                 0
>>
>> ReadRepairStage                        0         0         151473
>>   0                 0
>>
>> CounterMutationStage                   0         0              0
>>   0                 0
>>
>> MemtableFlushWriter                    1        81             76
>>   0                 0
>>
>> MemtablePostFlush                      1       382            139
>>   0                 0
>>
>> ValidationExecutor                     0         0              0
>>   0                 0
>>
>> ViewMutationStage                      0         0              0
>>   0                 0
>>
>> CacheCleanupExecutor                   0         0              0
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_10          0         0             69
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_11          0         0             69
>>   0                 0
>>
>> MemtableReclaimMemory                  0         0             81
>>   0                 0
>>
>> PendingRangeCalculator                 0         0             32
>>   0                 0
>>
>> SecondaryIndexManagement               0         0              0
>>   0                 0
>>
>> HintsDispatcher                        0         0            596
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_1           0         0             69
>>   0                 0
>>
>> Native-Transport-Requests             11         0        4547746
>>   0                67
>>
>> PerDiskMemtableFlushWriter_2           0         0             69
>>   0                 0
>>
>> MigrationStage                         1      1545            586
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_0           0         0             80
>>   0                 0
>>
>> Sampler                                0         0              0
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_5           0         0             69
>>   0                 0
>>
>> InternalResponseStage                  0         0          45432
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_6           0         0             69
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_3           0         0             69
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_4           0         0             69
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_9           0         0             69
>>   0                 0
>>
>> AntiEntropyStage                       0         0              0
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_7           0         0             69
>>   0                 0
>>
>> PerDiskMemtableFlushWriter_8           0         0             69
>>   0                 0
>>
>>
>>
>> Message type           Dropped
>>
>> READ                         0
>>
>> RANGE_SLICE                  0
>>
>> _TRACE                       0
>>
>> HINT                         0
>>
>> MUTATION                     0
>>
>> COUNTER_MUTATION             0
>>
>> BATCH_STORE                  0
>>
>> BATCH_REMOVE                 0
>>
>> REQUEST_RESPONSE             0
>>
>> PAGED_RANGE                  0
>>
>> READ_REPAIR                  0
>>
>>
>>
>> -dan
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313&entry=gmail&source=g>
>>
>
>
>
> --
> Dan Kinder
> Principal Software Engineer
> Turnitin – www.turnitin.com
> dkin...@turnitin.com
>

Re:

Reply via email to