Hi Okumin, I agree that setting hive.merge.nway.joins to true (or false) can have both positive and negative effects on the performance, depending on the query and other related configuration parameters. I thought the default value was set to false for some correctness issue (for example, mapjoin_filter_on_outerjoin.q fails).
While investigating this issue, I found a bug and created a new JIRA (with pull-request created by Seonggon). This bug is not easy to reproduce, but it does occur (causing NPE) when hive.merge.nway.joins is set to true. https://issues.apache.org/jira/browse/HIVE-27375 Thanks, --- Sungwoo On Sat, May 27, 2023 at 12:40 AM おくみん <m...@okumin.com> wrote: > Hi Sungwoo, > > I have totally no idea why we changed the default value. I'm just sharing > my knowledge and experience. > > First, I know there is a known issue when we use it with Tez. We can see > the following statement on the official website > <https://cwiki.apache.org/confluence/display/hive/configuration+properties > >. > > > For multiple joins on the same condition, merge joins together into a > single join operator. This is useful in the case of large shuffle joins to > avoid a reshuffle phase. Disabling this in Tez will often provide a faster > join algorithm in case of left outer joins or a general Snowflake schema. > > Honestly, I don't know the detail. But I have had one negative experience > so far. While I was using Hive 2 with `hive.merge.nway.joins=true`, Merge > Join was applied even though one or two tables are small enough. The > performance degraded because the largest table has a skew on the join key. > If I remember correctly, `hive.merge.nway.joins` merges multiple joins in > an early stage, and some optimization can miss a chance. Of course, I know > it can also positively work in some cases. > > Note that the version I used is a bit old, my memory could be wrong, and > again I am not sure about the concrete background of HIVE-21189. > > Thanks, > Okumin > > > On Thu, May 25, 2023 at 7:48 PM Sungwoo Park <glap...@gmail.com> wrote: > > > Hello, > > > > In HIVE-21189 [1], the default value for hive.merge.nway.joins is set to > > false. There is no record of why it was set to false, and I would like to > > understand the background for the decision. Specifically I wonder if the > > following situation is relevant to the decision. > > > > Example) > > MapJoinOp_1 joins: table G, table A, table B, table C > > MapJoinOp_2 joins: table G, table A, table B , table D > > > > Here, table G is a big table to be read via shuffling. > > MayJoinOp_1 needs table C, while MapJoinOp_2 needs table D. > > SharedWorkOptimizer assigns the same cache key to MapJoinOp_1 and > > MapJoinOp_2 (because of table G and table A), so that both operators can > > share in-memory tables. > > > > Assume that MapJoinOp_1 is executed first and fills the cache first. > Then, > > MapJoinOp_2 does not load the cache which is already filled. As a result, > > it ends up with something like NullPointerException. > > > > After setting hive.merge.nway.joins to true, I encountered a problem > (which > > is not easy to reproduce), and I wonder if the above scenario is feasible > > in the current implementation. > > > > Many thanks, > > > > --- Sungwoo > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/HIVE-21189 > > >