[ https://issues.apache.org/jira/browse/HIVE-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780015#comment-13780015 ]
Yin Huai commented on HIVE-5358: -------------------------------- My last example was not good... Let me try another example. The query may not make much sense, but I hope it can make the problem clear. {code} select c3, c2 from (select c1, c2, c3, c4 from t2 group by c1, c2, c3, c4) t group by c3, c2; {code} For the first GBY, we want to group rows based on [c1, c2, c3, c4] and then we want to group the output of the firs GBY based on [c3, c2]. We can use [c2, c3] as the partitioning columns to make sure rows will be distributed in a correct way. Then, if we use [c3, c2] as the sorting columns (key columns in RS), c1 and c4 will be in the value columns of RS. Seems we need to also adjust the first GBY to construct its key from both key and value of the reduce input. If we use [c1, c2, c3, c4] as the sorting columns, seems we need to introduce a sort operator to generate row groups based on [c3, c2]. I am also attaching the plan generated by your .2 patch {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: t:t2 TableScan alias: t2 Select Operator expressions: expr: c1 type: int expr: c2 type: int expr: c3 type: int expr: c4 type: int outputColumnNames: c1, c2, c3, c4 Group By Operator bucketGroup: false keys: expr: c1 type: int expr: c2 type: int expr: c3 type: int expr: c4 type: int mode: hash outputColumnNames: _col0, _col1, _col2, _col3 Reduce Output Operator key expressions: expr: _col0 type: int expr: _col1 type: int expr: _col2 type: int expr: _col3 type: int sort order: ++++ Map-reduce partition columns: expr: _col2 type: int expr: _col1 type: int tag: -1 Reduce Operator Tree: Group By Operator bucketGroup: false keys: expr: KEY._col0 type: int expr: KEY._col1 type: int expr: KEY._col2 type: int expr: KEY._col3 type: int mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3 Select Operator expressions: expr: _col2 type: int expr: _col1 type: int outputColumnNames: _col2, _col1 Group By Operator bucketGroup: false keys: expr: _col2 type: int expr: _col1 type: int mode: complete outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col0 type: int expr: _col1 type: int outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 {code} > ReduceSinkDeDuplication should ignore column orders when check overlapping > part of keys between parent and child > ---------------------------------------------------------------------------------------------------------------- > > Key: HIVE-5358 > URL: https://issues.apache.org/jira/browse/HIVE-5358 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Reporter: Chun Chen > Assignee: Chun Chen > Attachments: D13113.1.patch, HIVE-5358.2.patch, HIVE-5358.patch > > > {code} > select key, value from (select key, value from src group by key, value) t > group by key, value; > {code} > This can be optimized by ReduceSinkDeDuplication > {code} > select key, value from (select key, value from src group by key, value) t > group by value, key; > {code} > However the sql above can't be optimized by ReduceSinkDeDuplication currently > due to different column orders of parent and child operator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira