Hi, We've merged all three pull requests. Thanks for your contributions.
> 1. The query plan is identical, but Trino is much faster. This is due to the > architectural difference between Trino and Hive (on shuffle-intensive > queries): Trino is based on MPP and thus uses the push model, while Hive uses > the pull model. There is not much we can do about this type of queries. (Note > that the push model has its own drawbacks and thus does not always win over > the pull model. That's why Trino is much slower than Hive on many queries.) If we'd like to accelerate those queries, we may be able to enhance `tez.runtime.pipelined-shuffle.enabled`. I've not used this feature, and IMO the priority is lower considering the use case of Apache Hive. > 2. Trino generates a query plan that is clearly more efficient than Hive. We > made some attempt to find a solution in Hive, but came to a preliminary > conclusion that this would require a significant change in the query compiler > (e.g., if the decision made later during query compilation is inconsistent > with an earlier assumption, retry with a different assumption until > consistency is reached). Interesting. I would like to know more details about those points. We can help with that part, as I am also involved with Trino. Regards, Okumin On Tue, Nov 26, 2024 at 1:57 AM Sungwoo Park <glap...@gmail.com> wrote: > > Hello, > > Thanks a lot for reviewing HIVE-28489 in detail. So, finally it looks like > all the three patches will be merged to Hive! > > While analyzing query plans generated by Hive and Trino, we identified a few > more categories of queries on which Trino was much faster than Hive, such as: > > 1. The query plan is identical, but Trino is much faster. This is due to the > architectural difference between Trino and Hive (on shuffle-intensive > queries): Trino is based on MPP and thus uses the push model, while Hive uses > the pull model. There is not much we can do about this type of queries. (Note > that the push model has its own drawbacks and thus does not always win over > the pull model. That's why Trino is much slower than Hive on many queries.) > > 2. Trino generates a query plan that is clearly more efficient than Hive. We > made some attempt to find a solution in Hive, but came to a preliminary > conclusion that this would require a significant change in the query compiler > (e.g., if the decision made later during query compilation is inconsistent > with an earlier assumption, retry with a different assumption until > consistency is reached). > > Currently we don't fully understand the internals of Trino compilation, and > don't know how Trino generates efficient query plans. If we figure it out, we > can try to replicate the process in Hive. If not, we can open a discussion > where we can discuss potential solutions. > > Thanks, > > --- Sungwoo > > > On Mon, Nov 25, 2024 at 11:18 PM Okumin <m...@okumin.com> wrote: >> >> Hi, >> >> Thanks for submitting the patches and writing the post with fantastic >> illustrations. The impressive documentation linking each ticket made it easy >> for me to review. >> >> I gave +1 to the last pull request. Unless anyone finds another issue, I >> will merge it in 1 day. >> https://github.com/apache/hive/pull/5424 >> >> Let's keep Hive's performance competitive. >> >> Regards, >> Okumin >> >> On Thu, Oct 10, 2024 at 12:11 Sungwoo Park <glap...@gmail.com> wrote: >>> >>> Hello, >>> >>> For your query, pre-partitioning is performed on if() expressions. Since >>> the output of if() expressions is skewed, the resultant query plan is >>> inefficient. >>> >>> We think that the pre-partitioning columns should be restricted to those >>> found in grouping sets (which I think is the case in Trino). We will test a >>> new patch for HIVE-28489 implementing this idea. >>> >>> Thanks a lot for the feedback on HIVE-28489! >>> >>> --- Sungwoo >>> >>> >>> On Wed, Oct 9, 2024 at 7:07 PM lisoda <lis...@yeah.net> wrote: >>>> >>>> Hi. >>>> >>>> It's great to see this article. This indicates that since Hive 4.0, >>>> downstream vendors of the Hive project have also started to become more >>>> active. For the project's activity, this is a good thing. It shows that >>>> more and more developers and users are starting to try version 4.0. >>>> >>>> Back to this blog, it mentions three optimization patches, in fact, we >>>> noticed these three patches two weeks ago and have added them to our >>>> production environment. In most cases, these three patches can indeed >>>> greatly improve the efficiency of HIVEQL. >>>> However, we also encountered some problems in the process of using these >>>> three patches, and we raise these issues here in the hope of discussing >>>> them with everyone. >>>> >>>> The main problem we encountered was with HIVE-28489. Although in most >>>> cases, HIVE-28489 can effectively reduce the amount of SHUFFLE data and >>>> the load produced by grouping sets, it currently cannot optimize this type >>>> of scenario very well: >>>> >>>> >>>> SELECT >>>> count(distinct if(flag=0,uni_id,null)), >>>> count(distinct if(flag=1,uni_id,null)), >>>> count(distinct if(flag=2,uni_id,null)) >>>> from tbl >>>> group by c1,c2,c3 >>>> grouping sets( >>>> (c1,c2), >>>> (c2,c3), >>>> (c1,c3) >>>> ); >>>> >>>> >>>> In this type of SQL, we need to make precise statistics on the uni_id >>>> field based on the value of flag. After introducing HIVE-28489, it uses >>>> the value of the expression if(flag=0,uni_id,null) as the grouping >>>> condition. However, expressions like if(flag=0,uni_id,null) generate a >>>> large number of null values, ultimately leading to skewed tasks. This >>>> makes the execution of the Query very slow. >>>> >>>> At present, we do not have a very good way to handle this scenario. Does >>>> anyone have experience dealing with this kind of problem? If someone can >>>> guide me or join the discussion, I would be very grateful. >>>> >>>> Thanks, >>>> --LiSoDa >>>> >>>> >>>> >>>> >>>> 在 2024-10-09 16:21:20,"Sungwoo Park" <glap...@gmail.com> 写道: >>>> >>>> Hi everyone, >>>> >>>> We have published a blog article that reports the performance improvement >>>> from three patches HIVE-28488, HIVE-28489, and HIVE-28490 which we >>>> submitted some time ago. It evaluates Hive 4.0.1 on MR3 and Trino 453 on >>>> the 10TB TPC-DS benchmark, but the results could be useful to users of >>>> Apache Hive 4 (with or without LLAP). >>>> >>>> https://www.datamonad.com/post/2024-10-09-optimizing-hive-4.0-performance/ >>>> >>>> We got the ideas for the three patches by comparing query plans generated >>>> by Hive 4 and Trino (for those queries that Trino executes much faster >>>> than Hive 4). Currently the patches are not actively reviewed, so we would >>>> appreciate it if some committers could take a look and try merging them to >>>> the master branch. If you are currently using Hive 4, backporting these >>>> patches should improve the performance on some class of queries. >>>> >>>> Thanks, >>>> >>>> --- Sungwoo Park >>>> >>>> >>>>