[ https://issues.apache.org/jira/browse/HIVE-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780791#comment-13780791 ]
Hudson commented on HIVE-5357: ------------------------------ ABORTED: Integrated in Hive-trunk-hadoop2 #461 (See [https://builds.apache.org/job/Hive-trunk-hadoop2/461/]) HIVE-5357 : ReduceSinkDeDuplication optimizer pick the wrong keys in pRS-cGBYm-cRS-cGBYr scenario when there are distinct keys in child GBY (Chun Chen via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1526990) * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java * /hive/trunk/ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q * /hive/trunk/ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out > ReduceSinkDeDuplication optimizer pick the wrong keys in pRS-cGBYm-cRS-cGBYr > scenario when there are distinct keys in child GBY > ------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-5357 > URL: https://issues.apache.org/jira/browse/HIVE-5357 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.11.0 > Reporter: Chun Chen > Assignee: Chun Chen > Priority: Blocker > Fix For: 0.13.0 > > Attachments: HIVE-5357.patch > > > Example: > {code} > select key, count(distinct value) from (select key, value from src group by > key, value) t group by key; > //result > 0 0 NULL > 10 10 NULL > 100 100 NULL > 103 103 NULL > 104 104 NULL > {code} > Obviously the result is wrong. > When we have a simple group by query with a distinct column > {code} > explain select count(distinct value) from src group by key; > {code} > The plan is > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > src > TableScan > alias: src > Select Operator > expressions: > expr: key > type: string > expr: value > type: string > outputColumnNames: key, value > Group By Operator > aggregations: > expr: count(DISTINCT value) > bucketGroup: false > keys: > expr: key > type: string > expr: value > type: string > mode: hash > outputColumnNames: _col0, _col1, _col2 > Reduce Output Operator > key expressions: > expr: _col0 > type: string > expr: _col1 > type: string > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: string > tag: -1 > value expressions: > expr: _col2 > type: bigint > Reduce Operator Tree: > Group By Operator > aggregations: > expr: count(DISTINCT KEY._col1:0._col0) > bucketGroup: false > keys: > expr: KEY._col0 > type: string > mode: mergepartial > outputColumnNames: _col0, _col1 > Select Operator > expressions: > expr: _col1 > type: bigint > outputColumnNames: _col0 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > {code} > The map side GBY also adds the distinct columns (value in this case) to its > key columns. > When RSDedup optimizes a query involving a GBY with distinct keys, if > map-side aggregation is enabled, currently it assigns the map-side GBY's key > columns to the reduce-side GBY. So, for the example shown at the beginning, > after we generate a plan with a single MR job, the second GBY in the > reduce-side uses both key and value as its key columns. The correct key > column is key. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira