Hello anyone, I have a question regarding the sort shuffle. Roughly I'm doing something like:
rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) The problem is that in f2 I don't see the keys being sorted. The keys are Java Comparable not scala.math.Ordered or scala.math.Ordering (it would be weird for each key to implement Ordering as mentioned in the JIRA item https://issues.apache.org/jira/browse/SPARK-2045) Questions: 1. Do I need to explicitly sortByKey ? (if I do this I can see the keys correctly sorted in f2) ... but I'm worried about the extra costs since Spark 1.3.0 is supposed to use the SORT shuffle manager by default, right ? 2. Do I need each key to be an scala.math.Ordered ? ... is Java Comparable used at all ? ... btw I'm using Spark from Java ... don't ask me why :) Best, Marius