Your schema may have read repair (non-blocking, background) set to 10% (0.1, for dclocal). You may have GC pauses causing writes (or reads) to be delayed. You may be hitting a cassandra bug.
Would need the `TRACING` output to know for sure. On Mon, Aug 10, 2020 at 10:10 PM Tobias Eriksson < tobias.eriks...@qvantel.com> wrote: > Hi > > We have a Cassandra solution with 2 DCs where each DC has >30 nodes > > From time to time we see problems with READ REPAIR, but I am stuck with > the analysis > > We have a pattern for these faults where we do > > 1. INSERT with Local Quorum (2 out of 3) > 2. Wait for 0.5 - 1 seconds time window > 3. READ with Local Quorum (2 out of 3) > 1. Triggers a read repair > 4. Then we do an UPDATE … > > > > The replication factor is 3 > > In my world in (1) we for sure store the data in 2 out of 3 places, and I > would be surprised if we would not also reach the 3;rd node within 0.5 sec > > So how come in (3) the read can’t get a proper response from 2 out of 3 > > Some are saying the problem started occurring when we added DC2, but I > can’t understand how it could be as our query is Local Quorum and will > involve only DC1 > > > > How can I debug this fault ? > > How can I track if the data has reached all 3 nodes ? > > > > All ideas are welcome > > -Tobias > > > > >