Thanks for the suggestion, will take a look. Our code looks like this:
val rdd = sc.cassandraTable[EventV0](keyspace, "test") val transformed = rdd.map{e => EventV1(e.testId, e.ts, e.channel, e.groups, e.event)} transformed.saveToCassandra(keyspace, "test_v1") Not sure if this code might translate to limits. The total date in this table is +/- 2gb on disk, total data for each node is around 290gb. On Fri, Jun 26, 2015 at 7:01 PM Nate McCall <n...@thelastpickle.com> wrote: > > We notice incredibly slow reads, 600mb in an hour, we are using quorum > LOCAL_ONE reads. > > The load_one of Cassandra increases from <1 to 60! There is no CPU wait, > only user & nice. > > Without seeing the code and query, it's hard to tell, but I noticed > something similar when we had a client incorrectly using the 'take' method > for a result count like so: > val resultCount = query.take(count).length > > 'take' can call limit under the hood. The docs for the latter are > interesting: > "The limit will be applied for each created Spark partition. In other > words, unless the data are fetched from a single Cassandra partition the > number of results is unpredictable." [0] > > Removing that line (it wasnt necessary for the use case) and just relying > on a simple 'myRDD.select("my_col")).toArray.foreach" got performance back > to where it should be. Per the docs, limit (and therefore take) works fine > as long as the partition key is used as a predicate in the where clause > ("WHERE test_id = somevalue" in your example). > > [0] > https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L92-L101 > > -- > ----------------- > Nate McCall > Austin, TX > @zznate > > Co-Founder & Sr. Technical Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com >