28490 and the performance of Hive 4.0.1 on MR3 1.12 (vs Trino 453)

Okumin Mon, 25 Nov 2024 06:18:30 -0800

Hi,

Thanks for submitting the patches and writing the post with fantastic
illustrations. The impressive documentation linking each ticket made it
easy for me to review.


I gave +1 to the last pull request. Unless anyone finds another issue, I
will merge it in 1 day.
https://github.com/apache/hive/pull/5424

Let's keep Hive's performance competitive.

Regards,
Okumin

On Thu, Oct 10, 2024 at 12:11 Sungwoo Park <[email protected]> wrote:

> Hello,
>
> For your query, pre-partitioning is performed on if() expressions. Since
> the output of if() expressions is skewed, the resultant query plan is
> inefficient.
>
> We think that the pre-partitioning columns should be restricted to those
> found in grouping sets (which I think is the case in Trino). We will test a
> new patch for HIVE-28489 implementing this idea.
>
> Thanks a lot for the feedback on HIVE-28489!
>
> --- Sungwoo
>
>
> On Wed, Oct 9, 2024 at 7:07 PM lisoda <[email protected]> wrote:
>
>> Hi.
>>
>> It's great to see this article. This indicates that since Hive 4.0,
>> downstream vendors of the Hive project have also started to become more
>> active. For the project's activity, this is a good thing. It shows that
>> more and more developers and users are starting to try version 4.0.
>>
>> Back to this blog, it mentions three optimization patches, in fact, we
>> noticed these three patches two weeks ago and have added them to our
>> production environment. In most cases, these three patches can indeed
>> greatly improve the efficiency of HIVEQL.
>> However, we also encountered some problems in the process of using these
>> three patches, and we raise these issues here in the hope of discussing
>> them with everyone.
>>
>> The main problem we encountered was with HIVE-28489. Although in most
>> cases, HIVE-28489 can effectively reduce the amount of SHUFFLE data and the
>> load produced by grouping sets, it currently cannot optimize this type of
>> scenario very well:
>>
>>
>> SELECT
>> count(distinct if(flag=0,uni_id,null)),
>> count(distinct if(flag=1,uni_id,null)),
>> count(distinct if(flag=2,uni_id,null))
>> from tbl
>> group by c1,c2,c3
>> grouping sets(
>> (c1,c2),
>> (c2,c3),
>> (c1,c3)
>> );
>>
>>
>> In this type of SQL, we need to make precise statistics on the uni_id
>> field based on the value of flag. After introducing HIVE-28489, it uses the
>> value of the expression if(flag=0,uni_id,null) as the grouping condition.
>> However, expressions like if(flag=0,uni_id,null) generate a large number of
>> null values, ultimately leading to skewed tasks. This makes the execution
>> of the Query very slow.
>>
>> At present, we do not have a very good way to handle this scenario. Does
>> anyone have experience dealing with this kind of problem? If someone can
>> guide me or join the discussion, I would be very grateful.
>>
>> Thanks,
>> --LiSoDa
>>
>>
>>
>>
>> 在 2024-10-09 16:21:20，"Sungwoo Park" <[email protected]> 写道：
>>
>> Hi everyone,
>>
>> We have published a blog article that reports the performance improvement
>> from three patches HIVE-28488, HIVE-28489, and HIVE-28490 which we
>> submitted some time ago. It evaluates Hive 4.0.1 on MR3 and Trino 453 on
>> the 10TB TPC-DS benchmark, but the results could be useful to users of
>> Apache Hive 4 (with or without LLAP).
>>
>> https://www.datamonad.com/post/2024-10-09-optimizing-hive-4.0-performance/
>>
>> We got the ideas for the three patches by comparing query plans generated
>> by Hive 4 and Trino (for those queries that Trino executes much faster than
>> Hive 4). Currently the patches are not actively reviewed, so we would
>> appreciate it if some committers could take a look and try merging them to
>> the master branch. If you are currently using Hive 4, backporting these
>> patches should improve the performance on some class of queries.
>>
>> Thanks,
>>
>> --- Sungwoo Park
>>
>>
>>
>>

Re: HIVE-28488/28489/28490 and the performance of Hive 4.0.1 on MR3 1.12 (vs Trino 453)

Reply via email to