a skew join (where the dominant key is spread across multiple executors) is
pretty standard in other frameworks, see for example in scalding:
https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/JoinAlgorithms.scala

this would be a great addition to spark, and ideally it belongs in spark
core not sql.

its also real a big data problem (single key is too large for executor),
which makes it a hard sell in my experience. the interest in truly big data
in spark community has been somewhat limited...

On Tue, Jun 16, 2015 at 11:28 AM, Jon Walton <jon.w.wal...@gmail.com> wrote:

> On Fri, Jun 12, 2015 at 9:43 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> 2. Does 1.3.2 or 1.4 have any enhancements that can help?   I tried to
>>> use 1.3.1 but SPARK-6967 prohibits me from doing so.    Now that 1.4 is
>>> available, would any of the JOIN enhancements help this situation?
>>>
>>
>> I would try Spark 1.4 after running "SET
>> spark.sql.planner.sortMergeJoin=true".  Please report back if this works
>> for you.
>>
>
>
> Hi Michael,
>
> This does help.  The joins are faster and fewer executors are lost, but it
> seems the same core problem still exists - that a single executor is
> handling the majority of the join (the skewed key) and bottlenecking the
> job.
>
> One idea I had was to split the dimension table into two halves - a small
> half which can be broadcast, (with the skewed keys), and the other large
> half which could be sort merge joined, (with even distribution), and then
> performing two individual queries against the large fact table and union
> the results.    Does this sound like a worthwhile approach?
>
> Thank you,
>
> Jon
>
>

Reply via email to