Type of Shuffling is best explained by Matei in Spark Internals .
http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203
Why dont you look at that & then if you have follow up questions ask here,
also would be good to watch this whole talk as it talks about Spark job
flows in a lot more detail.

SCALA
import org.apache.spark.RangePartitioner;
var file=sc.textFile("<my local path>")
var partitionedFile=file.map(x=>(x,1))
var data= partitionedFile.partitionBy(new
RangePartitioner(3, partitionedFile))
data.glom().collect()(0).length
data.glom().collect()(1).length
data.glom().collect()(2).length
This will sample the RDD partitionedFile & then try to partition
partitionedFile in almost equal sizes.
Do not do collect if your data size is huge as this may OOM the driver,
write it to disk in that case.



Scala

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Tue, Feb 25, 2014 at 1:19 AM, Tao Xiao <xiaotao.cs....@gmail.com> wrote:

> I am a newbie to Spark and I need to know how RDD partitioning can be
> controlled in the process of shuffling. I have googled for examples but
> haven't found much concrete examples, in contrast with the fact that there
> are many good tutorials about Hadoop's shuffling and partitioner.
>
> Can anybody show me good tutorials explaining the process of shuffling in
> Spark, as well as examples of how to use a customized partitioner.?
>
>
> Best,
> Tao
>

Reply via email to