[ https://issues.apache.org/jira/browse/HIVE-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15766929#comment-15766929 ]
Rui Li commented on HIVE-15474: ------------------------------- Hi [~jcamachorodriguez], I think it's an interesting optimization and please help me understand it better. For groupBy + orderBy query, we'll roughly have the following operator chain (assuming map side aggregation is on - GBY1 and RS2): {{GBY1 -- RS2 -- GBY3 -- FS4 -- TS5 -- RS6 -- SEL7 -- FS8}} Is the proposal to push the limit into GBY3, so that we'll have less data for the sorting stage? If so, I think that requires the output of GBY3 already being sorted, right? For MR (and probably Tez too) the shuffled data is sorted by key within each partition, which means the input to GBY3 is sorted. So you're implying that given a sorted input, GBY operator will maintain that order in output? If my understanding is correct so far, this will have some problem with Hive on Spark. Because for Spark, the groupBy shuffling doesn't guarantee an order - the input to GBY3 may not be sorted. This makes sense in that groupBy semantic doesn't require a specific ordering. But maybe we can make some change to adapt to this optimization. Actually I'm also wondering: if we use parallel order by (to use a range partitioner rather than a hash partitioner in RS2), we can do the groupBy and orderBy in a single stage, which may improve performance in some cases. > Extend limit propagation for chain of RS-GB-RS operators > -------------------------------------------------------- > > Key: HIVE-15474 > URL: https://issues.apache.org/jira/browse/HIVE-15474 > Project: Hive > Issue Type: Bug > Components: Physical Optimizer > Affects Versions: 2.2.0 > Reporter: Jesus Camacho Rodriguez > Assignee: Jesus Camacho Rodriguez > Attachments: HIVE-15474.patch > > > The goal is to extend the work started in HIVE-14002. > For instance, given the following query: > {code:sql} > explain > select key, value, count(key + 1) as agg1 from src > group by key, value > order by key, value, agg1 limit 20; > {code} > We can push the limit to the GBy operator. However, currently we do not do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)