[ https://issues.apache.org/jira/browse/HIVE-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13778947#comment-13778947 ]
Yin Huai commented on HIVE-5358: -------------------------------- [~ashutoshc] I think that for those two cases with hive.optimize.correlation=true, the ordering of key columns does not matter. Because in those queries, we only need to group rows, either [key, value] or [value, key] should be fine for the RS. The reason that I preserved the ordering in Correlation Optimizer is ReduceSinkDeDuplication can merge the RS for ORDER BY with another RS (for example, GROUP BY). In this case, ordering matters. When Correlation Optimizer gets the operator tree, it does not know if the key columns in a RS is only used for grouping or those columns are also used for ordering. I think it may be better to annotate what columns are used for grouping and what columns are used for sorting. [~chenchun] For your change, what will be the plan for the following query? {code} select c3, c2 from (select c1, c2, c3 from t1 order by c1, c2, c3) t group by c3, c2; {code} If we use [c1, c2, c3] as the key columns, rows with the same [c3, c2] are not grouped at the reduce side. Based on my understanding, right now, the checkExprs in ReduceSinkDeDuplication only wants to handle cases that ckeys starts with pkeys, or pkeys starts with ckeys. For example, pkeys = [c1, c2, c3], and ckeys = [c1, c2]. > ReduceSinkDeDuplication should ignore column orders when check overlapping > part of keys between parent and child > ---------------------------------------------------------------------------------------------------------------- > > Key: HIVE-5358 > URL: https://issues.apache.org/jira/browse/HIVE-5358 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Reporter: Chun Chen > Assignee: Chun Chen > Attachments: D13113.1.patch, HIVE-5358.2.patch, HIVE-5358.patch > > > {code} > select key, value from (select key, value from src group by key, value) t > group by key, value; > {code} > This can be optimized by ReduceSinkDeDuplication > {code} > select key, value from (select key, value from src group by key, value) t > group by value, key; > {code} > However the sql above can't be optimized by ReduceSinkDeDuplication currently > due to different column orders of parent and child operator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira