Re: How to get well-distribute partition

Mayur Rustagi Mon, 24 Feb 2014 20:07:26 -0800

Easiest is to plugin your own partitioner if you know the nature of the
data. If you dont then you can sample the data for creating the partitions
weight, you can use RangePartitioner out of the box.


Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Mon, Feb 24, 2014 at 6:16 PM, zhaoxw12 <zhaox...@mails.tsinghua.edu.cn>wrote:

> I use spark-0.8.0. This is my code in python.
>
>
> list = [(i, i*i) for i in xrange(0, 16)]*10
> rdd = sc.parallelize(list, 80)
> temp = rdd.collect()
> temp2 = rdd.partitionBy(16, lambda x: x )
> count = 0
> for i in temp2.glom().collect():
>   print count, "**", i
>   count += 1
>
> This will get result below:
>
> 0 ** [(10, 100), (1, 1), (10, 100)... (10, 100), (1, 1), (10, 100)]
> 1 ** [(2, 4), (11, 121), (11, 121)... (2, 4), (11, 121), (2, 4), (11, 121)]
> 2 ** [(12, 144), (12, 144), (3, 9)... (12, 144), (12, 144), (3, 9), (12,
> 144)]
> 3 ** [(13, 169), (13, 169), (4, 16)... (4, 16), (13, 169), (13, 169), (13,
> 169)]
> 4 ** [(14, 196), (5, 25), (5, 25)... (14, 196), (5, 25), (14, 196), (5,
> 25),
> (14, 196)]
> 5 ** [(6, 36), (6, 36), (15, 225)...(6, 36), (6, 36), (15, 225), (15, 225),
> (6, 36)]
> 6 ** [(7, 49), (7, 49), (7, 49)... (7, 49), (7, 49), (7, 49), (7, 49), (7,
> 49), (7, 49)]
> 7 ** [(8, 64), (8, 64), (8, 64)... (8, 64), (8, 64), (8, 64), (8, 64), (8,
> 64), (8, 64)]
> 8 ** [(9, 81), (9, 81), (9, 81)... (9, 81), (9, 81), (9, 81), (9, 81), (9,
> 81), (9, 81)]
> 9 ** []
> 10 ** []
> 11 ** []
> 12 ** []
> 13 ** []
> 14 ** []
> 15 ** [(0, 0), (0, 0), (0, 0), (0, 0)... (0, 0), (0, 0), (0, 0), (0, 0),
> (0,
> 0)]
>
> I want that each partition only has one key number. As you see, there are 6
> partitions which are empty and 6 partition which have 2 key numbers. It
> will
> cause me a lot of trouble when I was handling big datas.
> I face the same problem as
> "
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html
> ".
> But I have not found a solution in python.
> Thank a lot for your help. :)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-well-distribute-partition-tp2002.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: How to get well-distribute partition

Reply via email to