Re: Shuffle question

Iulian Dragoș Wed, 22 Apr 2015 05:33:08 -0700

On Tue, Apr 21, 2015 at 2:38 PM, Marius Danciu <marius.dan...@gmail.com>
wrote:


> Hello anyone,
>
> I have a question regarding the sort shuffle. Roughly I'm doing something
> like:
>
> rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2)
>
> The problem is that in f2 I don't see the keys being sorted. The keys are
> Java Comparable  not scala.math.Ordered or scala.math.Ordering (it would be
> weird for each key to implement Ordering as mentioned in the JIRA item
> https://issues.apache.org/jira/browse/SPARK-2045)
>
> Questions:
> 1. Do I need to explicitly sortByKey ? (if I do this I can see the keys
> correctly sorted in f2) ... but I'm worried about the extra costs since
> Spark 1.3.0 is supposed to use the SORT shuffle manager by default, right ?
>

AFAIK the sort shuffle is not sorting *inside* each partition, unless the
shuffle comes from a sort. Otherwise, the shuffle file contains keys in
sorted order of their partitions= IDs. More details can be found in the
design document attached to SPARK-2045
<https://issues.apache.org/jira/browse/SPARK-2045>. This was enough to
improve memory consumption.


> 2. Do I need each key to be an scala.math.Ordered ? ... is Java Comparable
> used at all ?
>

There's implicit conversions from Comparable to Ordered, but that only
works for Scala code. Since you're using the Java API, I'm not sure what
you mean here. You can call `JavaPairRDD.sortByKey(comp)` with your own
comparator.

cheers,
iulian


>
> ... btw I'm using Spark from Java ... don't ask me why :)
>
>
>
> Best,
> Marius
>



-- 

--
Iulian Dragos

------
Reactive Apps on the JVM
www.typesafe.com

Re: Shuffle question

Reply via email to