[ https://issues.apache.org/jira/browse/HIVE-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589193#comment-14589193 ]
Chengbing Liu commented on HIVE-7261: ------------------------------------- HIVE-10971 solved the same problem, mark as a duplicate. > Calculation works wrong when hive.groupby.skewindata is true and count(*) > count(distinct) group by work simultaneously > ------------------------------------------------------------------------------------------------------------------------ > > Key: HIVE-7261 > URL: https://issues.apache.org/jira/browse/HIVE-7261 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.12.0 > Environment: hive0.12 hadoop1.0.4 > Reporter: Chris Chen > > 【Phenomenon】 > The query results are not the same as when hive.groupby.skewindata was setted > to true and false. > 【my question】 > I want to calculate the count(*) and count(distinct) simultaneously > ,otherwise it will cost 2 MR job to calculate. But when i set the > hive.groupby.skewindata to be true, the count(*) result shoud not be same as > the count(distinct) , but the real result is same, so it's wrong. And I find > the difference of its query plan which the "Reduce Operator Tree->Group By > Operator->mode" is mergepartial when skew is set to false and > "Reduce Operator Tree->Group By Operator->mode" is complete when skew is set > to true. So i'm confused the root cause of the error. > 【sql】 > select ds,appid,eventname,active,{color:red}count(distinct(guid)), count(*) > {color}from eventinfo_tmp where ds='20140612' and length(eventname)<1000 and > eventname like '%alibaba%' group by ds,appid,eventname,active; > 【the others hive configaration exclude hive.groupby.skewindata】 > hive.exec.compress.output=true > hive.exec.compress.intermediate=true > io.seqfile.compression.type=BLOCK > mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec > hive.map.aggr=true > hive.stats.autogather=false > hive.exec.scratchdir=/user/complat/tmp > mapred.job.queue.name=complat > hive.exec.mode.local.auto=false > hive.exec.mode.local.auto.inputbytes.max=500 > hive.exec.mode.local.auto.tasks.max=10 > hive.exec.mode.local.auto.input.files.max=1000 > hive.exec.dynamic.partition=true > hive.exec.dynamic.partition.mode=nonstrict > hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat > mapred.max.split.size=100000000 > mapred.min.split.size.per.node=100000000 > mapred.min.split.size.per.rack=100000000 > 【result】 > when hive.groupby.skewindata=true the result is : > 20140612 8 alibaba 1 {color:red}87 147{color} > when it=false the result is : > 20140612 8 alibaba 1 {color:red}87 87{color} > 【query plan】 > ABSTRACT SYNTAX TREE: > (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT > (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR > (TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR > (TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active)) > (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR > (TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds) > '20140612') (< (TOK_FUNCTION length (TOK_TABLE_OR_COL eventname)) 1000)) > (like (TOK_TABLE_OR_COL eventname) '%tvvideo_setting%'))) (TOK_GROUPBY > (TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname) > (TOK_TABLE_OR_COL active)))) > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > eventinfo_tmp > TableScan > alias: eventinfo_tmp > Filter Operator > predicate: > expr: ((length(eventname) < 1000) and (eventname like > '%tvvideo_setting%')) > type: boolean > Select Operator > expressions: > expr: ds > type: string > expr: appid > type: string > expr: eventname > type: string > expr: active > type: int > expr: guid > type: string > outputColumnNames: ds, appid, eventname, active, guid > Group By Operator > aggregations: > expr: count(DISTINCT guid) > expr: count() > bucketGroup: false > keys: > expr: ds > type: string > expr: appid > type: string > expr: eventname > type: string > expr: active > type: int > expr: guid > type: string > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3, _col4, > _col5, _col6 > Reduce Output Operator > key expressions: > expr: _col0 > type: string > expr: _col1 > type: string > expr: _col2 > type: string > expr: _col3 > type: int > expr: _col4 > type: string > sort order: +++++ > Map-reduce partition columns: > expr: _col0 > type: string > expr: _col1 > type: string > expr: _col2 > type: string > expr: _col3 > type: int > tag: -1 > value expressions: > expr: _col5 > type: bigint > expr: _col6 > type: bigint > Reduce Operator Tree: > Group By Operator > aggregations: > expr: count(DISTINCT KEY._col4:0._col0) > expr: count(VALUE._col1) > bucketGroup: false > keys: > expr: KEY._col0 > type: string > expr: KEY._col1 > type: string > expr: KEY._col2 > type: string > expr: KEY._col3 > type: int > mode: {color:red}mergepartial{color} > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Select Operator > expressions: > expr: _col0 > type: string > expr: _col1 > type: string > expr: _col2 > type: string > expr: _col3 > type: int > expr: _col4 > type: bigint > expr: _col5 > type: bigint > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > File Output Operator > compressed: true > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > ABSTRACT SYNTAX TREE: > (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT > (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR > (TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR > (TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active)) > (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR > (TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds) > '20140612') (< (TOK_FUNCTION length (TOK_TABLE_OR_COL eventname)) 1000)) > (like (TOK_TABLE_OR_COL eventname) '%tvvideo_setting%'))) (TOK_GROUPBY > (TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname) > (TOK_TABLE_OR_COL active)))) > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > eventinfo_tmp > TableScan > alias: eventinfo_tmp > Filter Operator > predicate: > expr: ((length(eventname) < 1000) and (eventname like > '%tvvideo_setting%')) > type: boolean > Select Operator > expressions: > expr: ds > type: string > expr: appid > type: string > expr: eventname > type: string > expr: active > type: int > expr: guid > type: string > outputColumnNames: ds, appid, eventname, active, guid > Group By Operator > aggregations: > expr: count(DISTINCT guid) > expr: count() > bucketGroup: false > keys: > expr: ds > type: string > expr: appid > type: string > expr: eventname > type: string > expr: active > type: int > expr: guid > type: string > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3, _col4, > _col5, _col6 > Reduce Output Operator > key expressions: > expr: _col0 > type: string > expr: _col1 > type: string > expr: _col2 > type: string > expr: _col3 > type: int > expr: _col4 > type: string > sort order: +++++ > Map-reduce partition columns: > expr: _col0 > type: string > expr: _col1 > type: string > expr: _col2 > type: string > expr: _col3 > type: int > tag: -1 > value expressions: > expr: _col5 > type: bigint > expr: _col6 > type: bigint > Reduce Operator Tree: > Group By Operator > aggregations: > expr: count(DISTINCT KEY._col4:0._col0) > expr: count(VALUE._col1) > bucketGroup: false > keys: > expr: KEY._col0 > type: string > expr: KEY._col1 > type: string > expr: KEY._col2 > type: string > expr: KEY._col3 > type: int > mode: {color:red}complete{color} > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Select Operator > expressions: > expr: _col0 > type: string > expr: _col1 > type: string > expr: _col2 > type: string > expr: _col3 > type: int > expr: _col4 > type: bigint > expr: _col5 > type: bigint > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > File Output Operator > compressed: true > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)