Some more info.
From java using the Datastax 4.9.0 driver, I'm selecting an entire
table, after about 17 million rows (the table is probably around 150
million rows), I get:
com.datastax.oss.driver.api.core.servererrors.ReadFailureException:
Cassandra failure during read query at consistency ONE (1 responses were
required but only 0 replica responded, 1 failed)
It's almost as if the data was not written with LOCAL_QUORUM, but I've
triple checked.
If I stop writes to the table and reduce the load on Cassandra, then it
(java program) works OK. Presto queries still fail, but that might be a
Presto issue. Interestingly they sometimes fail quickly, coming back
with the 'Cassandra failure during read query' error very quickly, but
sometimes go through 140 million rows and then die.
Are regular table repairs required to be run when using LOCAL_QUORUM? I
see no nodes down, or disk failures.
-Joe
On 12/14/2020 9:41 AM, Joe Obernberger wrote:
Thanks all for the help on this. I've changed all my writes to
LOCAL_QUORUM, and same with reads. Under a constant load of doing
writes to a table and reads from the same table, I'm still getting the:
DEBUG [ReadRepairStage:372] 2020-12-14 09:36:09,002
ReadCallback.java:244 - Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(-7287062361589376757,
44535f313034335f333332353839305f323032302d31322d31325430302d31392d33312e3330335a)
(054250ecd7170b1707ec36c6f1798ed0 vs 5752eec36bff050dd363b7803c500a95)
at
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
~[apache-cassandra-3.11.9.jar:3.11.9]
at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:235)
~[apache-cassandra-3.11.9.jar:3.11.9]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[na:1.8.0_272]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[na:1.8.0_272]
at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
[apache-cassandra-3.11.9.jar:3.11.9]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_272]
Under load this happens a lot; several times a second on each of the
server nodes. I started with a new table and under light load, it
worked wonderfully - no issues. But under heavy load, it still
occurs. Is there a different setting?
Also, when this happens, I cannot query the table from presto as I
then get the familiar:
"Query 20201214_143949_00000_b3fnt failed: Cassandra timeout during
read query at consistency LOCAL_QUORUM (2 responses were required but
only 1 replica responded)"
Changed presto to use ONE results in an error about 1 were required,
but only 1 responded.
Any ideas? Things to try? Thanks!
-Joe
On 12/3/2020 12:49 AM, Erick Ramirez wrote:
Thank you Steve - once I have the key, how do I get to a node?
Run this command to determine which replicas own the partition:
$ nodetool getendpoints <partition_key>
So if the propagation has not taken place and a node doesn't have
the data and is the first to 'be asked' the client will get no data?
That's correct. It will not return data it doesn't have when querying
with a consistency of ONE. There are limited cases where ONE is
applicable. In most cases, a strong consistency of LOCAL_QUORUM is
recommended to avoid the scenario you described. Cheers!
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
Virus-free. www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>