[ 
https://issues.apache.org/jira/browse/HIVE-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15766929#comment-15766929
 ] 

Rui Li commented on HIVE-15474:
-------------------------------

Hi [~jcamachorodriguez], I think it's an interesting optimization and please 
help me understand it better. For groupBy + orderBy query, we'll roughly have 
the following operator chain (assuming map side aggregation is on - GBY1 and 
RS2):
{{GBY1 -- RS2 -- GBY3 -- FS4 -- TS5 -- RS6 -- SEL7 -- FS8}}
Is the proposal to push the limit into GBY3, so that we'll have less data for 
the sorting stage? If so, I think that requires the output of GBY3 already 
being sorted, right? For MR (and probably Tez too) the shuffled data is sorted 
by key within each partition, which means the input to GBY3 is sorted. So 
you're implying that given a sorted input, GBY operator will maintain that 
order in output?
If my understanding is correct so far, this will have some problem with Hive on 
Spark. Because for Spark, the groupBy shuffling doesn't guarantee an order - 
the input to GBY3 may not be sorted. This makes sense in that groupBy semantic 
doesn't require a specific ordering. But maybe we can make some change to adapt 
to this optimization. Actually I'm also wondering: if we use parallel order by 
(to use a range partitioner rather than a hash partitioner in RS2), we can do 
the groupBy and orderBy in a single stage, which may improve performance in 
some cases.

> Extend limit propagation for chain of RS-GB-RS operators
> --------------------------------------------------------
>
>                 Key: HIVE-15474
>                 URL: https://issues.apache.org/jira/browse/HIVE-15474
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 2.2.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>         Attachments: HIVE-15474.patch
>
>
> The goal is to extend the work started in HIVE-14002.
> For instance, given the following query:
> {code:sql}
> explain
> select key, value, count(key + 1) as agg1 from src 
> group by key, value
> order by key, value, agg1 limit 20;
> {code}
> We can push the limit to the GBy operator. However, currently we do not do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to