[ https://issues.apache.org/jira/browse/HIVE-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765068#comment-15765068 ]
Xuefu Zhang commented on HIVE-15474: ------------------------------------ Hi [~jcamachorodriguez], thanks for the explanation. Re: GBy will not produce duplicates for those columns, while Hive implementation based on RS ensures that GBy output actually follows a certain order. This assumption isn't always true, actually. While MR-styled shuffle has that property (maybe Tez too), but it's not true for Hive on Spark where groups are not necessarily ordered. Nevertheless, I'm still not 100% if there is any impact. In fact, looking at test cases in the patch, I'm sure of the plan difference. Thus, it would be very helpful if you can provide explain output for the example query for both w and w/o the optimization. Thanks. > Extend limit propagation for chain of RS-GB-RS operators > -------------------------------------------------------- > > Key: HIVE-15474 > URL: https://issues.apache.org/jira/browse/HIVE-15474 > Project: Hive > Issue Type: Bug > Components: Physical Optimizer > Affects Versions: 2.2.0 > Reporter: Jesus Camacho Rodriguez > Assignee: Jesus Camacho Rodriguez > Attachments: HIVE-15474.patch > > > The goal is to extend the work started in HIVE-14002. > For instance, given the following query: > {code:sql} > explain > select key, value, count(key + 1) as agg1 from src > group by key, value > order by key, value, agg1 limit 20; > {code} > We can push the limit to the GBy operator. However, currently we do not do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)