Re: Cassandra/Spark failing to process large table

2018-03-08 Thread kurt greaves
Note that read repairs only occur for QUORUM/equivalent and higher, and also with a 10% (default) chance on anything less than QUORUM (ONE/LOCAL_ONE). This is configured at the table level through the dclocal_read_repair_chance and read_repair_chance settings (which are going away in 4.0). So if yo

Re: Cassandra/Spark failing to process large table

2018-03-08 Thread Faraz Mateen
Hi Ben, That makes sense. I also read about "read repairs". So, once an inconsistent record is read, cassandra synchronizes its replicas on other nodes as well. I ran the same spark query again, this time with default consistency level (LOCAL_ONE) and the result was correct. Thanks again for the

Re: Cassandra/Spark failing to process large table

2018-03-06 Thread Ben Slater
Hi Faraz Yes, it likely does mean there is inconsistency in the replicas. However, you shouldn’t be too freaked out about it - Cassandra is design to allow for this inconsistency to occur and the consistency levels allow you to achieve consistent results despite replicas not being consistent. To k

Re: Cassandra/Spark failing to process large table

2018-03-06 Thread Faraz Mateen
Thanks a lot for the response. Setting consistency to ALL/TWO started giving me consistent count results on both cqlsh and spark. As expected, my query time has increased by 1.5x ( Before, it was taking ~1.6 hours but with consistency level ALL, same query is taking ~2.4 hours to complete.) Does

Re: Cassandra/Spark failing to process large table

2018-03-03 Thread Ben Slater
Both CQLSH and the Spark Cassandra query at consistent level ONE (LOCAL_ONE for Spark connector) by default so if there is any inconsistency in your replicas this can resulting in inconsistent query results. See http://cassandra.apache.org/doc/latest/tools/cqlsh.html and https://github.com/datasta

Re: Cassandra/Spark failing to process large table

2018-03-03 Thread Kant Kodali
The fact that cqlsh itself gives different results tells me that this has nothing to do with spark. Moreover, spark results are monotonically increasing which seem to be more consistent than cqlsh. so I believe spark can be taken out of the equation. Now, while you are running these queries is th