Hello,

For your query, pre-partitioning is performed on if() expressions. Since
the output of if() expressions is skewed, the resultant query plan is
inefficient.

We think that the pre-partitioning columns should be restricted to those
found in grouping sets (which I think is the case in Trino). We will test a
new patch for HIVE-28489 implementing this idea.

Thanks a lot for the feedback on HIVE-28489!

--- Sungwoo


On Wed, Oct 9, 2024 at 7:07 PM lisoda <lis...@yeah.net> wrote:

> Hi.
>
> It's great to see this article. This indicates that since Hive 4.0,
> downstream vendors of the Hive project have also started to become more
> active. For the project's activity, this is a good thing. It shows that
> more and more developers and users are starting to try version 4.0.
>
> Back to this blog, it mentions three optimization patches, in fact, we
> noticed these three patches two weeks ago and have added them to our
> production environment. In most cases, these three patches can indeed
> greatly improve the efficiency of HIVEQL.
> However, we also encountered some problems in the process of using these
> three patches, and we raise these issues here in the hope of discussing
> them with everyone.
>
> The main problem we encountered was with HIVE-28489. Although in most
> cases, HIVE-28489 can effectively reduce the amount of SHUFFLE data and the
> load produced by grouping sets, it currently cannot optimize this type of
> scenario very well:
>
>
> SELECT
> count(distinct if(flag=0,uni_id,null)),
> count(distinct if(flag=1,uni_id,null)),
> count(distinct if(flag=2,uni_id,null))
> from tbl
> group by c1,c2,c3
> grouping sets(
> (c1,c2),
> (c2,c3),
> (c1,c3)
> );
>
>
> In this type of SQL, we need to make precise statistics on the uni_id
> field based on the value of flag. After introducing HIVE-28489, it uses the
> value of the expression if(flag=0,uni_id,null) as the grouping condition.
> However, expressions like if(flag=0,uni_id,null) generate a large number of
> null values, ultimately leading to skewed tasks. This makes the execution
> of the Query very slow.
>
> At present, we do not have a very good way to handle this scenario. Does
> anyone have experience dealing with this kind of problem? If someone can
> guide me or join the discussion, I would be very grateful.
>
> Thanks,
> --LiSoDa
>
>
>
>
> 在 2024-10-09 16:21:20,"Sungwoo Park" <glap...@gmail.com> 写道:
>
> Hi everyone,
>
> We have published a blog article that reports the performance improvement
> from three patches HIVE-28488, HIVE-28489, and HIVE-28490 which we
> submitted some time ago. It evaluates Hive 4.0.1 on MR3 and Trino 453 on
> the 10TB TPC-DS benchmark, but the results could be useful to users of
> Apache Hive 4 (with or without LLAP).
>
> https://www.datamonad.com/post/2024-10-09-optimizing-hive-4.0-performance/
>
> We got the ideas for the three patches by comparing query plans generated
> by Hive 4 and Trino (for those queries that Trino executes much faster than
> Hive 4). Currently the patches are not actively reviewed, so we would
> appreciate it if some committers could take a look and try merging them to
> the master branch. If you are currently using Hive 4, backporting these
> patches should improve the performance on some class of queries.
>
> Thanks,
>
> --- Sungwoo Park
>
>
>
>

Reply via email to