Hi Sungwoo, I have totally no idea why we changed the default value. I'm just sharing my knowledge and experience.
First, I know there is a known issue when we use it with Tez. We can see the following statement on the official website <https://cwiki.apache.org/confluence/display/hive/configuration+properties>. > For multiple joins on the same condition, merge joins together into a single join operator. This is useful in the case of large shuffle joins to avoid a reshuffle phase. Disabling this in Tez will often provide a faster join algorithm in case of left outer joins or a general Snowflake schema. Honestly, I don't know the detail. But I have had one negative experience so far. While I was using Hive 2 with `hive.merge.nway.joins=true`, Merge Join was applied even though one or two tables are small enough. The performance degraded because the largest table has a skew on the join key. If I remember correctly, `hive.merge.nway.joins` merges multiple joins in an early stage, and some optimization can miss a chance. Of course, I know it can also positively work in some cases. Note that the version I used is a bit old, my memory could be wrong, and again I am not sure about the concrete background of HIVE-21189. Thanks, Okumin On Thu, May 25, 2023 at 7:48 PM Sungwoo Park <glap...@gmail.com> wrote: > Hello, > > In HIVE-21189 [1], the default value for hive.merge.nway.joins is set to > false. There is no record of why it was set to false, and I would like to > understand the background for the decision. Specifically I wonder if the > following situation is relevant to the decision. > > Example) > MapJoinOp_1 joins: table G, table A, table B, table C > MapJoinOp_2 joins: table G, table A, table B , table D > > Here, table G is a big table to be read via shuffling. > MayJoinOp_1 needs table C, while MapJoinOp_2 needs table D. > SharedWorkOptimizer assigns the same cache key to MapJoinOp_1 and > MapJoinOp_2 (because of table G and table A), so that both operators can > share in-memory tables. > > Assume that MapJoinOp_1 is executed first and fills the cache first. Then, > MapJoinOp_2 does not load the cache which is already filled. As a result, > it ends up with something like NullPointerException. > > After setting hive.merge.nway.joins to true, I encountered a problem (which > is not easy to reproduce), and I wonder if the above scenario is feasible > in the current implementation. > > Many thanks, > > --- Sungwoo > > > > > > > [1] https://issues.apache.org/jira/browse/HIVE-21189 >