Thanks for the suggestion, will take a look.

Our code looks like this:

val rdd = sc.cassandraTable[EventV0](keyspace, "test")

val transformed = rdd.map{e => EventV1(e.testId, e.ts, e.channel,
e.groups, e.event)}
transformed.saveToCassandra(keyspace, "test_v1")

Not sure if this code might translate to limits.

The total date in this table is +/- 2gb on disk, total data for each node
is around 290gb.

On Fri, Jun 26, 2015 at 7:01 PM Nate McCall <n...@thelastpickle.com> wrote:

> > We notice incredibly slow reads, 600mb in an hour, we are using quorum
> LOCAL_ONE reads.
> > The load_one of Cassandra increases from <1 to 60! There is no CPU wait,
> only user & nice.
>
> Without seeing the code and query, it's hard to tell, but I noticed
> something similar when we had a client incorrectly using the 'take' method
> for a result count like so:
> val resultCount = query.take(count).length
>
> 'take' can call limit under the hood. The docs for the latter are
> interesting:
> "The limit will be applied for each created Spark partition. In other
> words, unless the data are fetched from a single Cassandra partition the
> number of results is unpredictable." [0]
>
> Removing that line (it wasnt necessary for the use case) and just relying
> on a simple 'myRDD.select("my_col")).toArray.foreach" got performance back
> to where it should be. Per the docs, limit (and therefore take) works fine
> as long as the partition key is used as a predicate in the where clause
> ("WHERE test_id = somevalue" in your example).
>
> [0]
> https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L92-L101
>
> --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Reply via email to