I imagine this is a sample example to explain a bigger concern. In general when you do a sort by key, it will implicitly shuffle the data by the key. Since you have 1 key (0) with 19999 and the other with just 1 record it will simply shuffle it into two very skewed partitions. One way you can solve this is by using salting. For example, you might do something like: val salted = sc.parallelize(2 to n).map(x=>((x/n) * 100 + random.nextInt(99),x)).sortByKey().map(x=>(x._1 / 100, x._2) so basically you would create a temporary key which has randomness in it but keeps the original key's true order and then change back to it.
Hope this helps... From: Zhang, Liyun [mailto:liyun.zh...@intel.com] Sent: Tuesday, September 06, 2016 11:13 AM To: user@spark.apache.org Subject: How to make the result of sortByKey distributed evenly? Hi all: I have a question about RDD.sortByKey val n=20000 val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey() sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest") sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like [(0,2),(0,3),.....,(0,19999),(1,20000)], the key is skewed. The result of sortByKey is expected to distributed evenly. But when I view the result and found that part-00000 is large and part-00001 is small. hadoop fs -ls /SkewedGroupByTest/ 16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items -rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest /_SUCCESS -rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21 /SkewedGroupByTest/part-00000 -rw-r--r-- 1 root supergroup 10 2016-09-06 03:21 /SkewedGroupByTest/part-00001 How can I get the result distributed evenly? I don't need that the key in the part-xxxxx are same and only need to guarantee the data in part-xxxx0 ~ part-xxxxx is sorted. Thanks for any help! Kelly Zhang/Zhang,Liyun Best Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RE-How-to-make-the-result-of-sortByKey-distributed-evenly-tp27662.html Sent from the Apache Spark User List mailing list archive at Nabble.com.