Hello anyone,

I have a question regarding the sort shuffle. Roughly I'm doing something
like:

rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2)

The problem is that in f2 I don't see the keys being sorted. The keys are
Java Comparable  not scala.math.Ordered or scala.math.Ordering (it would be
weird for each key to implement Ordering as mentioned in the JIRA item
https://issues.apache.org/jira/browse/SPARK-2045)

Questions:
1. Do I need to explicitly sortByKey ? (if I do this I can see the keys
correctly sorted in f2) ... but I'm worried about the extra costs since
Spark 1.3.0 is supposed to use the SORT shuffle manager by default, right ?
2. Do I need each key to be an scala.math.Ordered ? ... is Java Comparable
used at all ?

... btw I'm using Spark from Java ... don't ask me why :)



Best,
Marius

Reply via email to