Hi.
It's great to see this article. This indicates that since Hive 4.0, downstream vendors of the Hive project have also started to become more active. For the project's activity, this is a good thing. It shows that more and more developers and users are starting to try version 4.0. Back to this blog, it mentions three optimization patches, in fact, we noticed these three patches two weeks ago and have added them to our production environment. In most cases, these three patches can indeed greatly improve the efficiency of HIVEQL. However, we also encountered some problems in the process of using these three patches, and we raise these issues here in the hope of discussing them with everyone. The main problem we encountered was with HIVE-28489. Although in most cases, HIVE-28489 can effectively reduce the amount of SHUFFLE data and the load produced by grouping sets, it currently cannot optimize this type of scenario very well: SELECT count(distinct if(flag=0,uni_id,null)), count(distinct if(flag=1,uni_id,null)), count(distinct if(flag=2,uni_id,null)) from tbl group by c1,c2,c3 grouping sets( (c1,c2), (c2,c3), (c1,c3) ); In this type of SQL, we need to make precise statistics on the uni_id field based on the value of flag. After introducing HIVE-28489, it uses the value of the expression if(flag=0,uni_id,null) as the grouping condition. However, expressions like if(flag=0,uni_id,null) generate a large number of null values, ultimately leading to skewed tasks. This makes the execution of the Query very slow. At present, we do not have a very good way to handle this scenario. Does anyone have experience dealing with this kind of problem? If someone can guide me or join the discussion, I would be very grateful. Thanks, --LiSoDa 在 2024-10-09 16:21:20,"Sungwoo Park" <glap...@gmail.com> 写道: Hi everyone, We have published a blog article that reports the performance improvement from three patches HIVE-28488, HIVE-28489, and HIVE-28490 which we submitted some time ago. It evaluates Hive 4.0.1 on MR3 and Trino 453 on the 10TB TPC-DS benchmark, but the results could be useful to users of Apache Hive 4 (with or without LLAP). https://www.datamonad.com/post/2024-10-09-optimizing-hive-4.0-performance/ We got the ideas for the three patches by comparing query plans generated by Hive 4 and Trino (for those queries that Trino executes much faster than Hive 4). Currently the patches are not actively reviewed, so we would appreciate it if some committers could take a look and try merging them to the master branch. If you are currently using Hive 4, backporting these patches should improve the performance on some class of queries. Thanks, --- Sungwoo Park