[ https://issues.apache.org/jira/browse/HIVE-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779971#comment-13779971 ]
Yin Huai commented on HIVE-5357: -------------------------------- I did some updates in the description. So, people can know what bug we are addressing. > ReduceSinkDeDuplication optimizer pick the wrong keys in pRS-cGBYm-cRS-cGBYr > scenario when there are distinct keys in child GBY > ------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-5357 > URL: https://issues.apache.org/jira/browse/HIVE-5357 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.11.0 > Reporter: Chun Chen > Assignee: Chun Chen > Priority: Blocker > Fix For: 0.12.0 > > Attachments: HIVE-5357.patch > > > Example: > {code} > select key, count(distinct value) from (select key, value from src group by > key, value) t group by key; > //result > 0 0 NULL > 10 10 NULL > 100 100 NULL > 103 103 NULL > 104 104 NULL > {code} > Obviously the result is wrong. > When we have a simple group by query with a distinct column > {code} > explain select count(distinct value) from src group by key; > {code} > The plan is > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > src > TableScan > alias: src > Select Operator > expressions: > expr: key > type: string > expr: value > type: string > outputColumnNames: key, value > Group By Operator > aggregations: > expr: count(DISTINCT value) > bucketGroup: false > keys: > expr: key > type: string > expr: value > type: string > mode: hash > outputColumnNames: _col0, _col1, _col2 > Reduce Output Operator > key expressions: > expr: _col0 > type: string > expr: _col1 > type: string > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: string > tag: -1 > value expressions: > expr: _col2 > type: bigint > Reduce Operator Tree: > Group By Operator > aggregations: > expr: count(DISTINCT KEY._col1:0._col0) > bucketGroup: false > keys: > expr: KEY._col0 > type: string > mode: mergepartial > outputColumnNames: _col0, _col1 > Select Operator > expressions: > expr: _col1 > type: bigint > outputColumnNames: _col0 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > {code} > The map side GBY also adds the distinct columns (value in this case) to its > key columns. > When RSDedup optimizes a query involving a GBY with distinct keys, if > map-side aggregation is enabled, currently it assigns the map-side GBY's key > columns to the reduce-side GBY. So, for the example shown at the beginning, > after we generate a plan with a single MR job, the second GBY in the > reduce-side uses both key and value as its key columns. The correct key > column is key. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira