GitHub user wangxiaojing opened a pull request:
https://github.com/apache/spark/pull/3442
[SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
JIRA issue: [SPARK-4570](https://issues.apache.org/jira/browse/SPARK-4570)
We are planning to create a `BroadcastLeftSemiJoinHash` to implement the
broadcast join for `left semijoin`
In left semijoin :
If the size of data from right side is smaller than the user-settable
threshold `AUTO_BROADCASTJOIN_THRESHOLD`,
the planner would mark it as the `broadcast` relation and mark the other
relation as the stream side. The broadcast table will be broadcasted to all of
the executors involved in the join, as a `org.apache.spark.broadcast.Broadcast`
object. It will use `joins.BroadcastLeftSemiJoinHash`.,else it will use
`joins.LeftSemiJoinHash`.
The benchmark suggests these made the optimized version 4x faster when
`left semijoin`
<pre><code>
Original:
left semi join : 9288 ms
Optimized:
left semi join : 1963 ms
</code></pre>
The micro benchmark load `data1/kv3.txt` into a normal Hive table.
Benchmark code:
<pre><code>
def benchmark(f: => Unit) = {
val begin = System.currentTimeMillis()
f
val end = System.currentTimeMillis()
end - begin
}
val sc = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new HiveContext(sc)
import hiveContext._
sql("drop table if exists left_table")
sql("drop table if exists right_table")
sql( """create table left_table (key int, value string)
""".stripMargin)
sql( s"""load data local inpath "/data1/kv3.txt" into table left_table""")
sql( """create table right_table (key int, value string)
""".stripMargin)
sql(
"""
|from left_table
|insert overwrite table right_table
|select left_table.key, left_table.value
""".stripMargin)
val leftSimeJoin = sql(
"""select a.key from left_table a
|left semi join right_table b on a.key = b.key""".stripMargin)
val leftSemiJoinDuration = benchmark(leftSimeJoin.count())
println(s"left semi join : $leftSemiJoinDuration ms ")
</code></pre>
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wangxiaojing/spark SPARK-4570
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3442.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3442
----
commit 5d58772aa0bd7fd55a9b9495efbff5cc0b36aeae
Author: wangxiaojing <[email protected]>
Date: 2014-11-25T04:04:05Z
add BroadcastLeftSemiJoinHash
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]