Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
There may be cases where you want to adjust the number of partitions or explicitly call RDD.repartition or RDD.coalesce. However, I would start with the defaults and then adjust if necessary to improve performance (for example, if cores are idling because there aren't enough tasks you may want more

Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks Daniel for the detailed information. Since the RDD is already partitioned, there is no need to worry about repartitioning. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html Sent from the Apache Spark User Li

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
First, I think you might have a misconception about partitioning. ALL RDDs are partitioned (even if they are a single partition). When reading from HDFS the number of partitions depends on how the data is stored in HDFS. After data is shuffled (generally caused by things like reduceByKey), the numb

Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks, will give that a try. I see the number of partitions requested is 8 (through HashPartitioner(8)). If I have a 40 node cluster, whats the recommended number of partitions? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp1

Re: Ways to partition the RDD

2014-08-14 Thread ssb61
You can try something like this, val kvRdd = sc.textFile("rawdata/").map( m => { val pfUser = m.split("t",2) (pfUser(0) -> pfUser(1))})

Re: Ways to partition the RDD

2014-08-13 Thread bdev
Forgot to mention, I'm using Spark 1.0.0 and running against 40 node yarn-cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12088.html Sent from the Apache Spark User List mailing list archive at Nabble.com.