That’s right. By default, Spark prefers sort merge join.
While, in our product environment, there are many huge bucket tables. We can 
leverage the bucketing to avoid shuffle when join with other small tables (the 
small tables are not small enough to leverage broad cast join). Problem is 
that, although shuffle can be avoid, sort is still necessary to leverage sort 
merge join (we cannot pre-sort since there are different join patterns). For a 
huge table, sort may take even tens of seconds.
That’s why I’m trying to enable shuffle hash join, and for such cases, there 
were almost 10% ~ 20% improvement when apply shuffle hash join instead of sort 
merge join. I wonder if there is still some space to improve shuffle hash join? 
Like code generation for ShuffledHashJoinExec or something….

From: Wenchen Fan <cloud0...@gmail.com>
Date: Sunday, November 10, 2019 at 5:57 PM
To: "Wang, Gang" <gwa...@ebay.com.invalid>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Subject: Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

By default sort merge join is preferred over shuffle hash join, that's why we 
haven't spend resources to implement codegen for it.

On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang <gwa...@ebay.com.invalid> wrote:

There are some cases, shuffle hash join performs even better than sort merge 
join.

While, I noticed that ShuffledHashJoinExec does not implement CodegenSupport, 
is there any concern? And if there is any chance to improve the performance of 
ShuffledHashJoinExec?

Reply via email to