Take a look at the “advanced Spark features” talk here too: 
http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/.

Matei

On Feb 25, 2014, at 6:22 PM, Tao Xiao <xiaotao.cs....@gmail.com> wrote:

> Thank you Mayur, I think that will help me a lot 
> 
> 
> Best,
> Tao
> 
> 
> 2014-02-26 8:56 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>:
> Type of Shuffling is best explained by Matei in Spark Internals . 
> http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203
> Why dont you look at that & then if you have follow up questions ask here, 
> also would be good to watch this whole talk as it talks about Spark job flows 
> in a lot more detail. 
> 
> SCALA
> import org.apache.spark.RangePartitioner;
> var file=sc.textFile("<my local path>")
> var partitionedFile=file.map(x=>(x,1))
> var data= partitionedFile.partitionBy(new RangePartitioner(3, 
> partitionedFile))
> data.glom().collect()(0).length
> data.glom().collect()(1).length
> data.glom().collect()(2).length
> This will sample the RDD partitionedFile & then try to partition 
> partitionedFile in almost equal sizes. 
> Do not do collect if your data size is huge as this may OOM the driver, write 
> it to disk in that case. 
> 
> 
> 
> Scala 
> 
> Mayur Rustagi
> Ph: +919632149971
> http://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
> 
> 
> 
> On Tue, Feb 25, 2014 at 1:19 AM, Tao Xiao <xiaotao.cs....@gmail.com> wrote:
> I am a newbie to Spark and I need to know how RDD partitioning can be 
> controlled in the process of shuffling. I have googled for examples but 
> haven't found much concrete examples, in contrast with the fact that there 
> are many good tutorials about Hadoop's shuffling and partitioner.
> 
> Can anybody show me good tutorials explaining the process of shuffling in 
> Spark, as well as examples of how to use a customized partitioner.?
> 
> 
> Best,
> Tao
> 
> 

Reply via email to