There may be cases where you want to adjust the number of partitions or
explicitly call RDD.repartition or RDD.coalesce. However, I would start
with the defaults and then adjust if necessary to improve performance (for
example, if cores are idling because there aren't enough tasks you may want
more
Thanks Daniel for the detailed information. Since the RDD is already
partitioned, there is no need to worry about repartitioning.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html
Sent from the Apache Spark User Li
First, I think you might have a misconception about partitioning. ALL RDDs
are partitioned (even if they are a single partition). When reading from
HDFS the number of partitions depends on how the data is stored in HDFS.
After data is shuffled (generally caused by things like reduceByKey), the
numb
Thanks, will give that a try.
I see the number of partitions requested is 8 (through HashPartitioner(8)).
If I have a 40 node cluster, whats the recommended number of partitions?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp1
You can try something like this,
val kvRdd = sc.textFile("rawdata/").map( m => {
val
pfUser = m.split("t",2)
(pfUser(0) -> pfUser(1))})
Forgot to mention, I'm using Spark 1.0.0 and running against 40 node
yarn-cluster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12088.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.