The digest mismatch exception is not a problem, that's why it's only logged at debug.
As Thomas noted, there's a pretty good chance this is https://issues.apache.org/jira/browse/CASSANDRA-13754 - if you see a lot of GCInspector logs indicating GC pauses, that would add confidence to that diagnosis. <https://issues.apache.org/jira/browse/CASSANDRA-13754> On Thu, Sep 28, 2017 at 2:08 PM, Dan Kinder <dkin...@turnitin.com> wrote: > Thanks for the responses. > > @Prem yes this is after the entire cluster is on 3.11, but no I did not > run upgradesstables yet. > > @Thomas no I don't see any major GC going on. > > @Jeff yeah it's fully upgraded. I decided to shut the whole thing down and > bring it back (thankfully this cluster is not serving live traffic). The > nodes seemed okay for an hour or two, but I see the issue again, without me > bouncing any nodes. This time it's ReadStage that's building up, and the > exception I'm seeing in the logs is: > > DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242 > - Digest mismatch: > > org.apache.cassandra.service.DigestMismatchException: Mismatch for key > DecoratedKey(6150926370328526396, 696a6374652e6f7267) ( > 2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025) > > at > org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at org.apache.cassandra.service.ReadCallback$ > AsyncRepairRunner.run(ReadCallback.java:233) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_71] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_71] > > at org.apache.cassandra.concurrent.NamedThreadFactory. > lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > [apache-cassandra-3.11.0.jar:3.11.0] > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71] > > > Do you think running upgradesstables would help? Or relocatesstables? I > presumed it shouldn't be necessary for Cassandra to function, just an > optimization. > > On Thu, Sep 28, 2017 at 12:49 PM, Steinmaurer, Thomas < > thomas.steinmau...@dynatrace.com> wrote: > >> Dan, >> >> >> >> do you see any major GC? We have been hit by the following memory leak in >> our loadtest environment with 3.11.0. >> >> https://issues.apache.org/jira/browse/CASSANDRA-13754 >> >> >> >> So, depending on the heap size and uptime, you might get into heap >> troubles. >> >> >> >> Thomas >> >> >> >> *From:* Dan Kinder [mailto:dkin...@turnitin.com] >> *Sent:* Donnerstag, 28. September 2017 18:20 >> *To:* user@cassandra.apache.org >> *Subject:* >> >> >> >> Hi, >> >> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the >> following. The cluster does function, for a while, but then some stages >> begin to back up and the node does not recover and does not drain the >> tasks, even under no load. This happens both to MutationStage and >> GossipStage. >> >> I do see the following exception happen in the logs: >> >> >> >> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440 >> CassandraDaemon.java:228 - Exception in thread >> Thread[ReadRepairStage:2328,5,main] >> >> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed >> out - received only 1 responses. >> >> at >> org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at org.apache.cassandra.db.partitions.UnfilteredPartitionIterat >> ors$2.close(UnfilteredPartitionIterators.java:182) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at >> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at >> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThr >> ow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at >> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> ~[na:1.8.0_91] >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> ~[na:1.8.0_91] >> >> at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$ >> threadLocalDeallocator$0(NamedThreadFactory.java:81) >> ~[apache-cassandra-3.11.0.jar:3.11.0] >> >> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91] >> >> >> >> But it's hard to correlate precisely with things going bad. It is also >> very strange to me since I have both read_repair_chance and >> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is >> confusing why ReadRepairStage would err. >> >> Anyone have thoughts on this? It's pretty muddling, and causes nodes to >> lock up. Once it happens Cassandra can't even shut down, I have to kill -9. >> If I can't find a resolution I'm going to need to downgrade and restore to >> backup... >> >> The only issue I found that looked similar is https://issues.apache.org/j >> ira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10. >> >> >> >> $ nodetool tpstats >> >> Pool Name Active Pending Completed >> Blocked All time blocked >> >> ReadStage 0 0 582103 >> 0 0 >> >> MiscStage 0 0 0 >> 0 0 >> >> CompactionExecutor 11 11 2868 >> 0 0 >> >> MutationStage 32 4593678 55057393 >> 0 0 >> >> GossipStage 1 2818 371487 >> 0 0 >> >> RequestResponseStage 0 0 4345522 >> 0 0 >> >> ReadRepairStage 0 0 151473 >> 0 0 >> >> CounterMutationStage 0 0 0 >> 0 0 >> >> MemtableFlushWriter 1 81 76 >> 0 0 >> >> MemtablePostFlush 1 382 139 >> 0 0 >> >> ValidationExecutor 0 0 0 >> 0 0 >> >> ViewMutationStage 0 0 0 >> 0 0 >> >> CacheCleanupExecutor 0 0 0 >> 0 0 >> >> PerDiskMemtableFlushWriter_10 0 0 69 >> 0 0 >> >> PerDiskMemtableFlushWriter_11 0 0 69 >> 0 0 >> >> MemtableReclaimMemory 0 0 81 >> 0 0 >> >> PendingRangeCalculator 0 0 32 >> 0 0 >> >> SecondaryIndexManagement 0 0 0 >> 0 0 >> >> HintsDispatcher 0 0 596 >> 0 0 >> >> PerDiskMemtableFlushWriter_1 0 0 69 >> 0 0 >> >> Native-Transport-Requests 11 0 4547746 >> 0 67 >> >> PerDiskMemtableFlushWriter_2 0 0 69 >> 0 0 >> >> MigrationStage 1 1545 586 >> 0 0 >> >> PerDiskMemtableFlushWriter_0 0 0 80 >> 0 0 >> >> Sampler 0 0 0 >> 0 0 >> >> PerDiskMemtableFlushWriter_5 0 0 69 >> 0 0 >> >> InternalResponseStage 0 0 45432 >> 0 0 >> >> PerDiskMemtableFlushWriter_6 0 0 69 >> 0 0 >> >> PerDiskMemtableFlushWriter_3 0 0 69 >> 0 0 >> >> PerDiskMemtableFlushWriter_4 0 0 69 >> 0 0 >> >> PerDiskMemtableFlushWriter_9 0 0 69 >> 0 0 >> >> AntiEntropyStage 0 0 0 >> 0 0 >> >> PerDiskMemtableFlushWriter_7 0 0 69 >> 0 0 >> >> PerDiskMemtableFlushWriter_8 0 0 69 >> 0 0 >> >> >> >> Message type Dropped >> >> READ 0 >> >> RANGE_SLICE 0 >> >> _TRACE 0 >> >> HINT 0 >> >> MUTATION 0 >> >> COUNTER_MUTATION 0 >> >> BATCH_STORE 0 >> >> BATCH_REMOVE 0 >> >> REQUEST_RESPONSE 0 >> >> PAGED_RANGE 0 >> >> READ_REPAIR 0 >> >> >> >> -dan >> The contents of this e-mail are intended for the named addressee only. It >> contains information that may be confidential. Unless you are the named >> addressee or an authorized designee, you may not copy or use it, or >> disclose it to anyone else. If you received it in error please notify us >> immediately and then destroy it. Dynatrace Austria GmbH (registration >> number FN 91482h) is a company registered in Linz whose registered office >> is at 4040 Linz, Austria, Freistädterstraße 313 >> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313&entry=gmail&source=g> >> > > > > -- > Dan Kinder > Principal Software Engineer > Turnitin – www.turnitin.com > dkin...@turnitin.com >