a skew join (where the dominant key is spread across multiple executors) is pretty standard in other frameworks, see for example in scalding: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/JoinAlgorithms.scala
this would be a great addition to spark, and ideally it belongs in spark core not sql. its also real a big data problem (single key is too large for executor), which makes it a hard sell in my experience. the interest in truly big data in spark community has been somewhat limited... On Tue, Jun 16, 2015 at 11:28 AM, Jon Walton <jon.w.wal...@gmail.com> wrote: > On Fri, Jun 12, 2015 at 9:43 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> 2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to >>> use 1.3.1 but SPARK-6967 prohibits me from doing so. Now that 1.4 is >>> available, would any of the JOIN enhancements help this situation? >>> >> >> I would try Spark 1.4 after running "SET >> spark.sql.planner.sortMergeJoin=true". Please report back if this works >> for you. >> > > > Hi Michael, > > This does help. The joins are faster and fewer executors are lost, but it > seems the same core problem still exists - that a single executor is > handling the majority of the join (the skewed key) and bottlenecking the > job. > > One idea I had was to split the dimension table into two halves - a small > half which can be broadcast, (with the skewed keys), and the other large > half which could be sort merge joined, (with even distribution), and then > performing two individual queries against the large fact table and union > the results. Does this sound like a worthwhile approach? > > Thank you, > > Jon > >