Hello, For your query, pre-partitioning is performed on if() expressions. Since the output of if() expressions is skewed, the resultant query plan is inefficient.
We think that the pre-partitioning columns should be restricted to those found in grouping sets (which I think is the case in Trino). We will test a new patch for HIVE-28489 implementing this idea. Thanks a lot for the feedback on HIVE-28489! --- Sungwoo On Wed, Oct 9, 2024 at 7:07 PM lisoda <lis...@yeah.net> wrote: > Hi. > > It's great to see this article. This indicates that since Hive 4.0, > downstream vendors of the Hive project have also started to become more > active. For the project's activity, this is a good thing. It shows that > more and more developers and users are starting to try version 4.0. > > Back to this blog, it mentions three optimization patches, in fact, we > noticed these three patches two weeks ago and have added them to our > production environment. In most cases, these three patches can indeed > greatly improve the efficiency of HIVEQL. > However, we also encountered some problems in the process of using these > three patches, and we raise these issues here in the hope of discussing > them with everyone. > > The main problem we encountered was with HIVE-28489. Although in most > cases, HIVE-28489 can effectively reduce the amount of SHUFFLE data and the > load produced by grouping sets, it currently cannot optimize this type of > scenario very well: > > > SELECT > count(distinct if(flag=0,uni_id,null)), > count(distinct if(flag=1,uni_id,null)), > count(distinct if(flag=2,uni_id,null)) > from tbl > group by c1,c2,c3 > grouping sets( > (c1,c2), > (c2,c3), > (c1,c3) > ); > > > In this type of SQL, we need to make precise statistics on the uni_id > field based on the value of flag. After introducing HIVE-28489, it uses the > value of the expression if(flag=0,uni_id,null) as the grouping condition. > However, expressions like if(flag=0,uni_id,null) generate a large number of > null values, ultimately leading to skewed tasks. This makes the execution > of the Query very slow. > > At present, we do not have a very good way to handle this scenario. Does > anyone have experience dealing with this kind of problem? If someone can > guide me or join the discussion, I would be very grateful. > > Thanks, > --LiSoDa > > > > > 在 2024-10-09 16:21:20,"Sungwoo Park" <glap...@gmail.com> 写道: > > Hi everyone, > > We have published a blog article that reports the performance improvement > from three patches HIVE-28488, HIVE-28489, and HIVE-28490 which we > submitted some time ago. It evaluates Hive 4.0.1 on MR3 and Trino 453 on > the 10TB TPC-DS benchmark, but the results could be useful to users of > Apache Hive 4 (with or without LLAP). > > https://www.datamonad.com/post/2024-10-09-optimizing-hive-4.0-performance/ > > We got the ideas for the three patches by comparing query plans generated > by Hive 4 and Trino (for those queries that Trino executes much faster than > Hive 4). Currently the patches are not actively reviewed, so we would > appreciate it if some committers could take a look and try merging them to > the master branch. If you are currently using Hive 4, backporting these > patches should improve the performance on some class of queries. > > Thanks, > > --- Sungwoo Park > > > >