Also thanks Matei

2014-02-26 15:19 GMT+08:00 Matei Zaharia <matei.zaha...@gmail.com>:

> Take a look at the "advanced Spark features" talk here too:
> http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/.
>
> Matei
>
> On Feb 25, 2014, at 6:22 PM, Tao Xiao <xiaotao.cs....@gmail.com> wrote:
>
> Thank you Mayur, I think that will help me a lot
>
>
> Best,
> Tao
>
>
> 2014-02-26 8:56 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>:
>
>> Type of Shuffling is best explained by Matei in Spark Internals .
>> http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203
>>  Why dont you look at that & then if you have follow up questions ask
>> here, also would be good to watch this whole talk as it talks about Spark
>> job flows in a lot more detail.
>>
>> SCALA
>> import org.apache.spark.RangePartitioner;
>> var file=sc.textFile("<my local path>")
>> var partitionedFile=file.map(x=>(x,1))
>> var data= partitionedFile.partitionBy(new
>> RangePartitioner(3, partitionedFile))
>> data.glom().collect()(0).length
>> data.glom().collect()(1).length
>> data.glom().collect()(2).length
>> This will sample the RDD partitionedFile & then try to partition
>> partitionedFile in almost equal sizes.
>> Do not do collect if your data size is huge as this may OOM the driver,
>> write it to disk in that case.
>>
>>
>>
>> Scala
>>
>> Mayur Rustagi
>> Ph: +919632149971
>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>> https://twitter.com/mayur_rustagi
>>
>>
>>
>> On Tue, Feb 25, 2014 at 1:19 AM, Tao Xiao <xiaotao.cs....@gmail.com>wrote:
>>
>>> I am a newbie to Spark and I need to know how RDD partitioning can be
>>> controlled in the process of shuffling. I have googled for examples but
>>> haven't found much concrete examples, in contrast with the fact that there
>>> are many good tutorials about Hadoop's shuffling and partitioner.
>>>
>>> Can anybody show me good tutorials explaining the process of shuffling
>>> in Spark, as well as examples of how to use a customized partitioner.?
>>>
>>>
>>> Best,
>>> Tao
>>>
>>
>>
>
>

Reply via email to