Thanks Helena, very useful comment, But is "‘spark.cassandra.input.split.size" only effective in Cluster not in Single node?
best, /Shahab On Thu, Oct 30, 2014 at 6:26 PM, Helena Edelson <[email protected] > wrote: > Shahab, > > Regardless, WRT cassandra and spark when using the spark cassandra > connector, ‘spark.cassandra.input.split.size’ passed into the SparkConf > configures the approx number of Cassandra partitions in a Spark partition > (default 100000). > No repartitioning should be necessary with what you have below, but I > don’t know if you are running on one node or a cluster. > > This is a good initial guide: > > https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#configuration-options-for-adjusting-reads > > https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L26-L37 > > Cheers, > Helena > @helenaedelson > > On Oct 30, 2014, at 1:12 PM, Helena Edelson <[email protected]> > wrote: > > Hi Shahab, > -How many spark/cassandra nodes are in your cluster? > -What is your deploy topology for spark and cassandra clusters? Are they > co-located? > > - Helena > @helenaedelson > > On Oct 30, 2014, at 12:16 PM, shahab <[email protected]> wrote: > > Hi. > > I am running an application in the Spark which first loads data from > Cassandra and then performs some map/reduce jobs. > > val srdd = sqlContext.sql("select * from mydb.mytable " ) > I noticed that the "srdd" only has one partition . no matter how big is > the data loaded form Cassandra. > > So I perform "repartition" on the RDD , and then I did the map/reduce > functions. > > But the main problem is that "repartition" takes so much time (almost 2 > min), which is not acceptable in my use-case. Is there any better way to do > repartitioning? > > best, > /Shahab > > > >
