Also thanks Matei
2014-02-26 15:19 GMT+08:00 Matei Zaharia <matei.zaha...@gmail.com>: > Take a look at the "advanced Spark features" talk here too: > http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/. > > Matei > > On Feb 25, 2014, at 6:22 PM, Tao Xiao <xiaotao.cs....@gmail.com> wrote: > > Thank you Mayur, I think that will help me a lot > > > Best, > Tao > > > 2014-02-26 8:56 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>: > >> Type of Shuffling is best explained by Matei in Spark Internals . >> http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203 >> Why dont you look at that & then if you have follow up questions ask >> here, also would be good to watch this whole talk as it talks about Spark >> job flows in a lot more detail. >> >> SCALA >> import org.apache.spark.RangePartitioner; >> var file=sc.textFile("<my local path>") >> var partitionedFile=file.map(x=>(x,1)) >> var data= partitionedFile.partitionBy(new >> RangePartitioner(3, partitionedFile)) >> data.glom().collect()(0).length >> data.glom().collect()(1).length >> data.glom().collect()(2).length >> This will sample the RDD partitionedFile & then try to partition >> partitionedFile in almost equal sizes. >> Do not do collect if your data size is huge as this may OOM the driver, >> write it to disk in that case. >> >> >> >> Scala >> >> Mayur Rustagi >> Ph: +919632149971 >> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com >> https://twitter.com/mayur_rustagi >> >> >> >> On Tue, Feb 25, 2014 at 1:19 AM, Tao Xiao <xiaotao.cs....@gmail.com>wrote: >> >>> I am a newbie to Spark and I need to know how RDD partitioning can be >>> controlled in the process of shuffling. I have googled for examples but >>> haven't found much concrete examples, in contrast with the fact that there >>> are many good tutorials about Hadoop's shuffling and partitioner. >>> >>> Can anybody show me good tutorials explaining the process of shuffling >>> in Spark, as well as examples of how to use a customized partitioner.? >>> >>> >>> Best, >>> Tao >>> >> >> > >