Hi, Thanks for submitting the patches and writing the post with fantastic illustrations. The impressive documentation linking each ticket made it easy for me to review.
I gave +1 to the last pull request. Unless anyone finds another issue, I will merge it in 1 day. https://github.com/apache/hive/pull/5424 Let's keep Hive's performance competitive. Regards, Okumin On Thu, Oct 10, 2024 at 12:11 Sungwoo Park <glap...@gmail.com> wrote: > Hello, > > For your query, pre-partitioning is performed on if() expressions. Since > the output of if() expressions is skewed, the resultant query plan is > inefficient. > > We think that the pre-partitioning columns should be restricted to those > found in grouping sets (which I think is the case in Trino). We will test a > new patch for HIVE-28489 implementing this idea. > > Thanks a lot for the feedback on HIVE-28489! > > --- Sungwoo > > > On Wed, Oct 9, 2024 at 7:07 PM lisoda <lis...@yeah.net> wrote: > >> Hi. >> >> It's great to see this article. This indicates that since Hive 4.0, >> downstream vendors of the Hive project have also started to become more >> active. For the project's activity, this is a good thing. It shows that >> more and more developers and users are starting to try version 4.0. >> >> Back to this blog, it mentions three optimization patches, in fact, we >> noticed these three patches two weeks ago and have added them to our >> production environment. In most cases, these three patches can indeed >> greatly improve the efficiency of HIVEQL. >> However, we also encountered some problems in the process of using these >> three patches, and we raise these issues here in the hope of discussing >> them with everyone. >> >> The main problem we encountered was with HIVE-28489. Although in most >> cases, HIVE-28489 can effectively reduce the amount of SHUFFLE data and the >> load produced by grouping sets, it currently cannot optimize this type of >> scenario very well: >> >> >> SELECT >> count(distinct if(flag=0,uni_id,null)), >> count(distinct if(flag=1,uni_id,null)), >> count(distinct if(flag=2,uni_id,null)) >> from tbl >> group by c1,c2,c3 >> grouping sets( >> (c1,c2), >> (c2,c3), >> (c1,c3) >> ); >> >> >> In this type of SQL, we need to make precise statistics on the uni_id >> field based on the value of flag. After introducing HIVE-28489, it uses the >> value of the expression if(flag=0,uni_id,null) as the grouping condition. >> However, expressions like if(flag=0,uni_id,null) generate a large number of >> null values, ultimately leading to skewed tasks. This makes the execution >> of the Query very slow. >> >> At present, we do not have a very good way to handle this scenario. Does >> anyone have experience dealing with this kind of problem? If someone can >> guide me or join the discussion, I would be very grateful. >> >> Thanks, >> --LiSoDa >> >> >> >> >> 在 2024-10-09 16:21:20,"Sungwoo Park" <glap...@gmail.com> 写道: >> >> Hi everyone, >> >> We have published a blog article that reports the performance improvement >> from three patches HIVE-28488, HIVE-28489, and HIVE-28490 which we >> submitted some time ago. It evaluates Hive 4.0.1 on MR3 and Trino 453 on >> the 10TB TPC-DS benchmark, but the results could be useful to users of >> Apache Hive 4 (with or without LLAP). >> >> https://www.datamonad.com/post/2024-10-09-optimizing-hive-4.0-performance/ >> >> We got the ideas for the three patches by comparing query plans generated >> by Hive 4 and Trino (for those queries that Trino executes much faster than >> Hive 4). Currently the patches are not actively reviewed, so we would >> appreciate it if some committers could take a look and try merging them to >> the master branch. If you are currently using Hive 4, backporting these >> patches should improve the performance on some class of queries. >> >> Thanks, >> >> --- Sungwoo Park >> >> >> >>