> We notice incredibly slow reads, 600mb in an hour, we are using quorum
LOCAL_ONE reads.
> The load_one of Cassandra increases from <1 to 60! There is no CPU wait,
only user & nice.

Without seeing the code and query, it's hard to tell, but I noticed
something similar when we had a client incorrectly using the 'take' method
for a result count like so:
val resultCount = query.take(count).length

'take' can call limit under the hood. The docs for the latter are
interesting:
"The limit will be applied for each created Spark partition. In other
words, unless the data are fetched from a single Cassandra partition the
number of results is unpredictable." [0]

Removing that line (it wasnt necessary for the use case) and just relying
on a simple 'myRDD.select("my_col")).toArray.foreach" got performance back
to where it should be. Per the docs, limit (and therefore take) works fine
as long as the partition key is used as a predicate in the where clause
("WHERE test_id = somevalue" in your example).

[0]
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L92-L101

--
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Reply via email to