Hi.

It's great to see this article. This indicates that since Hive 4.0, downstream 
vendors of the Hive project have also started to become more active. For the 
project's activity, this is a good thing. It shows that more and more 
developers and users are starting to try version 4.0.


Back to this blog, it mentions three optimization patches, in fact, we noticed 
these three patches two weeks ago and have added them to our production 
environment. In most cases, these three patches can indeed greatly improve the 
efficiency of HIVEQL.
However, we also encountered some problems in the process of using these three 
patches, and we raise these issues here in the hope of discussing them with 
everyone.


The main problem we encountered was with HIVE-28489. Although in most cases, 
HIVE-28489 can effectively reduce the amount of SHUFFLE data and the load 
produced by grouping sets, it currently cannot optimize this type of scenario 
very well:




SELECT
count(distinct if(flag=0,uni_id,null)),
count(distinct if(flag=1,uni_id,null)),
count(distinct if(flag=2,uni_id,null))
from tbl
group by c1,c2,c3
grouping sets(
(c1,c2),
(c2,c3),
(c1,c3)
);




In this type of SQL, we need to make precise statistics on the uni_id field 
based on the value of flag. After introducing HIVE-28489, it uses the value of 
the expression if(flag=0,uni_id,null) as the grouping condition. However, 
expressions like if(flag=0,uni_id,null) generate a large number of null values, 
ultimately leading to skewed tasks. This makes the execution of the Query very 
slow.


At present, we do not have a very good way to handle this scenario. Does anyone 
have experience dealing with this kind of problem? If someone can guide me or 
join the discussion, I would be very grateful.


Thanks,
--LiSoDa









在 2024-10-09 16:21:20,"Sungwoo Park" <glap...@gmail.com> 写道:

Hi everyone,


We have published a blog article that reports the performance improvement from 
three patches HIVE-28488, HIVE-28489, and HIVE-28490 which we submitted some 
time ago. It evaluates Hive 4.0.1 on MR3 and Trino 453 on the 10TB TPC-DS 
benchmark, but the results could be useful to users of Apache Hive 4 (with or 
without LLAP).


https://www.datamonad.com/post/2024-10-09-optimizing-hive-4.0-performance/



We got the ideas for the three patches by comparing query plans generated by 
Hive 4 and Trino (for those queries that Trino executes much faster than Hive 
4). Currently the patches are not actively reviewed, so we would appreciate it 
if some committers could take a look and try merging them to the master branch. 
If you are currently using Hive 4, backporting these patches should improve the 
performance on some class of queries.


Thanks,


--- Sungwoo Park





Reply via email to