I imagine this is a sample example to explain a bigger concern.
In general when you do a sort by key, it will implicitly shuffle the data by 
the key. Since you have 1 key (0) with 19999 and the other with just 1 record 
it will simply shuffle it into two very skewed partitions.
One way you can solve this is by using salting. For example, you might do 
something like:
val salted = sc.parallelize(2 to n).map(x=>((x/n) * 100 + 
random.nextInt(99),x)).sortByKey().map(x=>(x._1 / 100, x._2)
so basically you would create a temporary key which has randomness in it but 
keeps the original key's true order and then change back to it.

Hope this helps...



From: Zhang, Liyun [mailto:liyun.zh...@intel.com]
Sent: Tuesday, September 06, 2016 11:13 AM
To: user@spark.apache.org
Subject: How to make the result of sortByKey distributed evenly?

Hi all:
  I have a question about RDD.sortByKey

val n=20000
val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey()
 sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")

sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like 
[(0,2),(0,3),.....,(0,19999),(1,20000)], the key is skewed.

The result of sortByKey is expected to distributed evenly. But when I view the 
result and found that part-00000 is large and part-00001 is small.

 hadoop fs -ls /SkewedGroupByTest/
16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest /_SUCCESS
-rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21 
/SkewedGroupByTest/part-00000
-rw-r--r-- 1 root supergroup 10 2016-09-06 03:21 /SkewedGroupByTest/part-00001

How can I get the result distributed evenly?  I don't need that the key in the 
part-xxxxx are same and only need to guarantee the data in part-xxxx0 ~ 
part-xxxxx is sorted.


Thanks for any help!


Kelly Zhang/Zhang,Liyun
Best Regards





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RE-How-to-make-the-result-of-sortByKey-distributed-evenly-tp27662.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to