[ 
https://issues.apache.org/jira/browse/HIVE-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765068#comment-15765068
 ] 

Xuefu Zhang commented on HIVE-15474:
------------------------------------

Hi [~jcamachorodriguez], thanks for the explanation. 

Re: GBy will not produce duplicates for those columns, while Hive 
implementation based on RS ensures that GBy output actually follows a certain 
order.

This assumption isn't always true, actually. While MR-styled shuffle has that 
property (maybe Tez too), but it's not true for Hive on Spark where groups are 
not necessarily ordered.

Nevertheless, I'm still not 100% if there is any impact. In fact, looking at 
test cases in the patch, I'm sure of the plan difference. Thus, it would be 
very helpful if you can provide explain output for the example query for both w 
and w/o the optimization.

Thanks.

> Extend limit propagation for chain of RS-GB-RS operators
> --------------------------------------------------------
>
>                 Key: HIVE-15474
>                 URL: https://issues.apache.org/jira/browse/HIVE-15474
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 2.2.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>         Attachments: HIVE-15474.patch
>
>
> The goal is to extend the work started in HIVE-14002.
> For instance, given the following query:
> {code:sql}
> explain
> select key, value, count(key + 1) as agg1 from src 
> group by key, value
> order by key, value, agg1 limit 20;
> {code}
> We can push the limit to the GBy operator. However, currently we do not do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to