Re: Best way to partition RDD

shahab Thu, 30 Oct 2014 14:47:25 -0700

Thanks Helena, very useful comment,
But is "‘spark.cassandra.input.split.size" only effective in Cluster not in
Single node?


best,
/Shahab

On Thu, Oct 30, 2014 at 6:26 PM, Helena Edelson <[email protected]
> wrote:

> Shahab,
>
> Regardless, WRT cassandra and spark when using the spark cassandra
> connector,  ‘spark.cassandra.input.split.size’ passed into the SparkConf
> configures the approx number of Cassandra partitions in a Spark partition
> (default 100000).
> No repartitioning should be necessary with what you have below, but I
> don’t know if you are running on one node or a cluster.
>
> This is a good initial guide:
>
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#configuration-options-for-adjusting-reads
>
> https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L26-L37
>
> Cheers,
> Helena
> @helenaedelson
>
> On Oct 30, 2014, at 1:12 PM, Helena Edelson <[email protected]>
> wrote:
>
> Hi Shahab,
> -How many spark/cassandra nodes are in your cluster?
> -What is your deploy topology for spark and cassandra clusters? Are they
> co-located?
>
> - Helena
> @helenaedelson
>
> On Oct 30, 2014, at 12:16 PM, shahab <[email protected]> wrote:
>
> Hi.
>
> I am running an application in the Spark which first loads data from
> Cassandra and then performs some map/reduce jobs.
>
> val srdd = sqlContext.sql("select * from mydb.mytable "  )
> I noticed that the "srdd" only has one partition . no matter how big is
> the data loaded form Cassandra.
>
> So I perform "repartition" on the RDD , and then I did the map/reduce
> functions.
>
> But the main problem is that "repartition" takes so much time (almost 2
> min), which is not acceptable in my use-case. Is there any better way to do
> repartitioning?
>
> best,
> /Shahab
>
>
>
>

Re: Best way to partition RDD

Reply via email to