Easiest is to plugin your own partitioner if you know the nature of the data. If you dont then you can sample the data for creating the partitions weight, you can use RangePartitioner out of the box.
Mayur Rustagi Ph: +919632149971 h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Mon, Feb 24, 2014 at 6:16 PM, zhaoxw12 <zhaox...@mails.tsinghua.edu.cn>wrote: > I use spark-0.8.0. This is my code in python. > > > list = [(i, i*i) for i in xrange(0, 16)]*10 > rdd = sc.parallelize(list, 80) > temp = rdd.collect() > temp2 = rdd.partitionBy(16, lambda x: x ) > count = 0 > for i in temp2.glom().collect(): > print count, "**", i > count += 1 > > This will get result below: > > 0 ** [(10, 100), (1, 1), (10, 100)... (10, 100), (1, 1), (10, 100)] > 1 ** [(2, 4), (11, 121), (11, 121)... (2, 4), (11, 121), (2, 4), (11, 121)] > 2 ** [(12, 144), (12, 144), (3, 9)... (12, 144), (12, 144), (3, 9), (12, > 144)] > 3 ** [(13, 169), (13, 169), (4, 16)... (4, 16), (13, 169), (13, 169), (13, > 169)] > 4 ** [(14, 196), (5, 25), (5, 25)... (14, 196), (5, 25), (14, 196), (5, > 25), > (14, 196)] > 5 ** [(6, 36), (6, 36), (15, 225)...(6, 36), (6, 36), (15, 225), (15, 225), > (6, 36)] > 6 ** [(7, 49), (7, 49), (7, 49)... (7, 49), (7, 49), (7, 49), (7, 49), (7, > 49), (7, 49)] > 7 ** [(8, 64), (8, 64), (8, 64)... (8, 64), (8, 64), (8, 64), (8, 64), (8, > 64), (8, 64)] > 8 ** [(9, 81), (9, 81), (9, 81)... (9, 81), (9, 81), (9, 81), (9, 81), (9, > 81), (9, 81)] > 9 ** [] > 10 ** [] > 11 ** [] > 12 ** [] > 13 ** [] > 14 ** [] > 15 ** [(0, 0), (0, 0), (0, 0), (0, 0)... (0, 0), (0, 0), (0, 0), (0, 0), > (0, > 0)] > > I want that each partition only has one key number. As you see, there are 6 > partitions which are empty and 6 partition which have 2 key numbers. It > will > cause me a lot of trouble when I was handling big datas. > I face the same problem as > " > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html > ". > But I have not found a solution in python. > Thank a lot for your help. :) > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-well-distribute-partition-tp2002.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >