Hi Okumin,

I agree that setting hive.merge.nway.joins to true (or false) can have both
positive and negative effects on the performance, depending on the query
and other related configuration parameters. I thought the default value was
set to false for some correctness issue (for
example, mapjoin_filter_on_outerjoin.q fails).

While investigating this issue, I found a bug and created a new JIRA (with
pull-request created by Seonggon). This bug is not easy to reproduce, but
it does occur (causing NPE) when hive.merge.nway.joins is set to true.

https://issues.apache.org/jira/browse/HIVE-27375

Thanks,

--- Sungwoo


On Sat, May 27, 2023 at 12:40 AM おくみん <m...@okumin.com> wrote:

> Hi Sungwoo,
>
> I have totally no idea why we changed the default value. I'm just sharing
> my knowledge and experience.
>
> First, I know there is a known issue when we use it with Tez. We can see
> the following statement on the official website
> <https://cwiki.apache.org/confluence/display/hive/configuration+properties
> >.
>
> > For multiple joins on the same condition, merge joins together into a
> single join operator. This is useful in the case of large shuffle joins to
> avoid a reshuffle phase. Disabling this in Tez will often provide a faster
> join algorithm in case of left outer joins or a general Snowflake schema.
>
> Honestly, I don't know the detail. But I have had one negative experience
> so far. While I was using Hive 2 with `hive.merge.nway.joins=true`, Merge
> Join was applied even though one or two tables are small enough. The
> performance degraded because the largest table has a skew on the join key.
> If I remember correctly, `hive.merge.nway.joins` merges multiple joins in
> an early stage, and some optimization can miss a chance. Of course, I know
> it can also positively work in some cases.
>
> Note that the version I used is a bit old, my memory could be wrong, and
> again I am not sure about the concrete background of HIVE-21189.
>
> Thanks,
> Okumin
>
>
> On Thu, May 25, 2023 at 7:48 PM Sungwoo Park <glap...@gmail.com> wrote:
>
> > Hello,
> >
> > In HIVE-21189 [1], the default value for hive.merge.nway.joins is set to
> > false. There is no record of why it was set to false, and I would like to
> > understand the background for the decision. Specifically I wonder if the
> > following situation is relevant to the decision.
> >
> > Example)
> > MapJoinOp_1 joins: table G, table A, table B, table C
> > MapJoinOp_2 joins: table G, table A, table B              , table D
> >
> > Here, table G is a big table to be read via shuffling.
> > MayJoinOp_1 needs table C, while MapJoinOp_2 needs table D.
> > SharedWorkOptimizer assigns the same cache key to MapJoinOp_1 and
> > MapJoinOp_2 (because of table G and table A), so that both operators can
> > share in-memory tables.
> >
> > Assume that MapJoinOp_1 is executed first and fills the cache first.
> Then,
> > MapJoinOp_2 does not load the cache which is already filled. As a result,
> > it ends up with something like NullPointerException.
> >
> > After setting hive.merge.nway.joins to true, I encountered a problem
> (which
> > is not easy to reproduce), and I wonder if the above scenario is feasible
> > in the current implementation.
> >
> > Many thanks,
> >
> > --- Sungwoo
> >
> >
> >
> >
> >
> >
> > [1] https://issues.apache.org/jira/browse/HIVE-21189
> >
>

Reply via email to