28490 and the performance of Hive 4.0.1 on MR3 1.12 (vs Trino 453)

Shohei Okumiya Thu, 28 Nov 2024 07:36:46 -0800

Hi,

We've merged all three pull requests. Thanks for your contributions.


> 1. The query plan is identical, but Trino is much faster. This is due to the 
> architectural difference between Trino and Hive (on shuffle-intensive 
> queries): Trino is based on MPP and thus uses the push model, while Hive uses 
> the pull model. There is not much we can do about this type of queries. (Note 
> that the push model has its own drawbacks and thus does not always win over 
> the pull model. That's why Trino is much slower than Hive on many queries.)

If we'd like to accelerate those queries, we may be able to enhance
`tez.runtime.pipelined-shuffle.enabled`. I've not used this feature,
and IMO the priority is lower considering the use case of Apache Hive.

> 2. Trino generates a query plan that is clearly more efficient than Hive. We 
> made some attempt to find a solution in Hive, but came to a preliminary 
> conclusion that this would require a significant change in the query compiler 
> (e.g., if the decision made later during query compilation is inconsistent 
> with an earlier assumption, retry with a different assumption until 
> consistency is reached).

Interesting. I would like to know more details about those points. We
can help with that part, as I am also involved with Trino.

Regards,
Okumin

On Tue, Nov 26, 2024 at 1:57 AM Sungwoo Park <glap...@gmail.com> wrote:
>
> Hello,
>
> Thanks a lot for reviewing HIVE-28489 in detail. So, finally it looks like 
> all the three patches will be merged to Hive!
>
> While analyzing query plans generated by Hive and Trino, we identified a few 
> more categories of queries on which Trino was much faster than Hive, such as:
>
> 1. The query plan is identical, but Trino is much faster. This is due to the 
> architectural difference between Trino and Hive (on shuffle-intensive 
> queries): Trino is based on MPP and thus uses the push model, while Hive uses 
> the pull model. There is not much we can do about this type of queries. (Note 
> that the push model has its own drawbacks and thus does not always win over 
> the pull model. That's why Trino is much slower than Hive on many queries.)
>
> 2. Trino generates a query plan that is clearly more efficient than Hive. We 
> made some attempt to find a solution in Hive, but came to a preliminary 
> conclusion that this would require a significant change in the query compiler 
> (e.g., if the decision made later during query compilation is inconsistent 
> with an earlier assumption, retry with a different assumption until 
> consistency is reached).
>
> Currently we don't fully understand the internals of Trino compilation, and 
> don't know how Trino generates efficient query plans. If we figure it out, we 
> can try to replicate the process in Hive. If not, we can open a discussion 
> where we can discuss potential solutions.
>
> Thanks,
>
> --- Sungwoo
>
>
> On Mon, Nov 25, 2024 at 11:18 PM Okumin <m...@okumin.com> wrote:
>>
>> Hi,
>>
>> Thanks for submitting the patches and writing the post with fantastic 
>> illustrations. The impressive documentation linking each ticket made it easy 
>> for me to review.
>>
>> I gave +1 to the last pull request. Unless anyone finds another issue, I 
>> will merge it in 1 day.
>> https://github.com/apache/hive/pull/5424
>>
>> Let's keep Hive's performance competitive.
>>
>> Regards,
>> Okumin
>>
>> On Thu, Oct 10, 2024 at 12:11 Sungwoo Park <glap...@gmail.com> wrote:
>>>
>>> Hello,
>>>
>>> For your query, pre-partitioning is performed on if() expressions. Since 
>>> the output of if() expressions is skewed, the resultant query plan is 
>>> inefficient.
>>>
>>> We think that the pre-partitioning columns should be restricted to those 
>>> found in grouping sets (which I think is the case in Trino). We will test a 
>>> new patch for HIVE-28489 implementing this idea.
>>>
>>> Thanks a lot for the feedback on HIVE-28489!
>>>
>>> --- Sungwoo
>>>
>>>
>>> On Wed, Oct 9, 2024 at 7:07 PM lisoda <lis...@yeah.net> wrote:
>>>>
>>>> Hi.
>>>>
>>>> It's great to see this article. This indicates that since Hive 4.0, 
>>>> downstream vendors of the Hive project have also started to become more 
>>>> active. For the project's activity, this is a good thing. It shows that 
>>>> more and more developers and users are starting to try version 4.0.
>>>>
>>>> Back to this blog, it mentions three optimization patches, in fact, we 
>>>> noticed these three patches two weeks ago and have added them to our 
>>>> production environment. In most cases, these three patches can indeed 
>>>> greatly improve the efficiency of HIVEQL.
>>>> However, we also encountered some problems in the process of using these 
>>>> three patches, and we raise these issues here in the hope of discussing 
>>>> them with everyone.
>>>>
>>>> The main problem we encountered was with HIVE-28489. Although in most 
>>>> cases, HIVE-28489 can effectively reduce the amount of SHUFFLE data and 
>>>> the load produced by grouping sets, it currently cannot optimize this type 
>>>> of scenario very well:
>>>>
>>>>
>>>> SELECT
>>>> count(distinct if(flag=0,uni_id,null)),
>>>> count(distinct if(flag=1,uni_id,null)),
>>>> count(distinct if(flag=2,uni_id,null))
>>>> from tbl
>>>> group by c1,c2,c3
>>>> grouping sets(
>>>> (c1,c2),
>>>> (c2,c3),
>>>> (c1,c3)
>>>> );
>>>>
>>>>
>>>> In this type of SQL, we need to make precise statistics on the uni_id 
>>>> field based on the value of flag. After introducing HIVE-28489, it uses 
>>>> the value of the expression if(flag=0,uni_id,null) as the grouping 
>>>> condition. However, expressions like if(flag=0,uni_id,null) generate a 
>>>> large number of null values, ultimately leading to skewed tasks. This 
>>>> makes the execution of the Query very slow.
>>>>
>>>> At present, we do not have a very good way to handle this scenario. Does 
>>>> anyone have experience dealing with this kind of problem? If someone can 
>>>> guide me or join the discussion, I would be very grateful.
>>>>
>>>> Thanks,
>>>> --LiSoDa
>>>>
>>>>
>>>>
>>>>
>>>> 在 2024-10-09 16:21:20，"Sungwoo Park" <glap...@gmail.com> 写道：
>>>>
>>>> Hi everyone,
>>>>
>>>> We have published a blog article that reports the performance improvement 
>>>> from three patches HIVE-28488, HIVE-28489, and HIVE-28490 which we 
>>>> submitted some time ago. It evaluates Hive 4.0.1 on MR3 and Trino 453 on 
>>>> the 10TB TPC-DS benchmark, but the results could be useful to users of 
>>>> Apache Hive 4 (with or without LLAP).
>>>>
>>>> https://www.datamonad.com/post/2024-10-09-optimizing-hive-4.0-performance/
>>>>
>>>> We got the ideas for the three patches by comparing query plans generated 
>>>> by Hive 4 and Trino (for those queries that Trino executes much faster 
>>>> than Hive 4). Currently the patches are not actively reviewed, so we would 
>>>> appreciate it if some committers could take a look and try merging them to 
>>>> the master branch. If you are currently using Hive 4, backporting these 
>>>> patches should improve the performance on some class of queries.
>>>>
>>>> Thanks,
>>>>
>>>> --- Sungwoo Park
>>>>
>>>>
>>>>

Re: HIVE-28488/28489/28490 and the performance of Hive 4.0.1 on MR3 1.12 (vs Trino 453)

Reply via email to