Li,
I’ve confirmed that the inconsistency issues disappeared after repair > finished. > > Anything changed with repair in 3.11.1? One difference I noticed is that > the validation step during repair could turn down the node upon large > tables, which never happen in 3.10. I had to throttle validation requests > to let it pass. Also I switched back to -pr instead of incremental repair > which is a resource killer and often hangs for the first node to be > repaired. > When you switched back to non-incremental did you set `repairedAt` on all sstables (on all nodes) back to zero (or unrepaired state)? This should have been done with `sstablerepairedset --is-unrepaired … ` while the node is stopped. > To address the inconsistency issue, I could do Write All and Read One by > giving up availability and stop running repair. Any comments on that? > You loose availability doing this, and at the number of reads you're doing I would not recommend it. You could think about using a fallback strategy that initially tries CL.ALL and falls back to CL.QUORUM. But this is a hack, could overload your cluster, and if there's any correlation to dropped messages or flapping nodes won't help. I'd also be prepared to upgrade to 3.11.3, when it does get released. regards, Mick -- Mick Semb Wever Australia The Last Pickle Apache Cassandra Consulting http://www.thelastpickle.com