Type of Shuffling is best explained by Matei in Spark Internals . http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203 Why dont you look at that & then if you have follow up questions ask here, also would be good to watch this whole talk as it talks about Spark job flows in a lot more detail.
SCALA import org.apache.spark.RangePartitioner; var file=sc.textFile("<my local path>") var partitionedFile=file.map(x=>(x,1)) var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile)) data.glom().collect()(0).length data.glom().collect()(1).length data.glom().collect()(2).length This will sample the RDD partitionedFile & then try to partition partitionedFile in almost equal sizes. Do not do collect if your data size is huge as this may OOM the driver, write it to disk in that case. Scala Mayur Rustagi Ph: +919632149971 h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Tue, Feb 25, 2014 at 1:19 AM, Tao Xiao <xiaotao.cs....@gmail.com> wrote: > I am a newbie to Spark and I need to know how RDD partitioning can be > controlled in the process of shuffling. I have googled for examples but > haven't found much concrete examples, in contrast with the fact that there > are many good tutorials about Hadoop's shuffling and partitioner. > > Can anybody show me good tutorials explaining the process of shuffling in > Spark, as well as examples of how to use a customized partitioner.? > > > Best, > Tao >