Hi Sungwoo,

I have totally no idea why we changed the default value. I'm just sharing
my knowledge and experience.

First, I know there is a known issue when we use it with Tez. We can see
the following statement on the official website
<https://cwiki.apache.org/confluence/display/hive/configuration+properties>.

> For multiple joins on the same condition, merge joins together into a
single join operator. This is useful in the case of large shuffle joins to
avoid a reshuffle phase. Disabling this in Tez will often provide a faster
join algorithm in case of left outer joins or a general Snowflake schema.

Honestly, I don't know the detail. But I have had one negative experience
so far. While I was using Hive 2 with `hive.merge.nway.joins=true`, Merge
Join was applied even though one or two tables are small enough. The
performance degraded because the largest table has a skew on the join key.
If I remember correctly, `hive.merge.nway.joins` merges multiple joins in
an early stage, and some optimization can miss a chance. Of course, I know
it can also positively work in some cases.

Note that the version I used is a bit old, my memory could be wrong, and
again I am not sure about the concrete background of HIVE-21189.

Thanks,
Okumin


On Thu, May 25, 2023 at 7:48 PM Sungwoo Park <glap...@gmail.com> wrote:

> Hello,
>
> In HIVE-21189 [1], the default value for hive.merge.nway.joins is set to
> false. There is no record of why it was set to false, and I would like to
> understand the background for the decision. Specifically I wonder if the
> following situation is relevant to the decision.
>
> Example)
> MapJoinOp_1 joins: table G, table A, table B, table C
> MapJoinOp_2 joins: table G, table A, table B              , table D
>
> Here, table G is a big table to be read via shuffling.
> MayJoinOp_1 needs table C, while MapJoinOp_2 needs table D.
> SharedWorkOptimizer assigns the same cache key to MapJoinOp_1 and
> MapJoinOp_2 (because of table G and table A), so that both operators can
> share in-memory tables.
>
> Assume that MapJoinOp_1 is executed first and fills the cache first. Then,
> MapJoinOp_2 does not load the cache which is already filled. As a result,
> it ends up with something like NullPointerException.
>
> After setting hive.merge.nway.joins to true, I encountered a problem (which
> is not easy to reproduce), and I wonder if the above scenario is feasible
> in the current implementation.
>
> Many thanks,
>
> --- Sungwoo
>
>
>
>
>
>
> [1] https://issues.apache.org/jira/browse/HIVE-21189
>

Reply via email to