Calculation works wrong when hive.groupby.skewindata is true and count(*) count(distinct) group by work simultaneously

Chris Chen Mon, 30 Jun 2014 22:45:27 -0700

 

【Phenomenon】
The query results are not the same as when hive.groupby.skewindata was
setted to true and false.


【my question】
I want to calculate the count(*) and count(distinct) simultaneously
,otherwise it will cost 2 MR job to calculate. But when i set the
hive.groupby.skewindata to be true, the count(*) result shoud not be same as
the count(distinct) , but the real result is same, so it's wrong. And I find
the difference of its query plan which the "Reduce Operator Tree->Group By
Operator->mode" is mergepartial when skew is set to false and 
"Reduce Operator Tree->Group By Operator->mode" is complete when skew is set
to true. So i'm confused the root cause of the error.

【sql】
select ds,appid,eventname,active,count(distinct(guid)), count(*) from
eventinfo_tmp where ds='20140612' and length(eventname)<1000 and eventname
like '%alibaba%' group by ds,appid,eventname,active;

【the others hive configaration exclude hive.groupby.skewindata】
hive.exec.compress.output=true
hive.exec.compress.intermediate=true
io.seqfile.compression.type=BLOCK
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
hive.map.aggr=true
hive.stats.autogather=false
hive.exec.scratchdir=/user/complat/tmp
mapred.job.queue.name=complat
hive.exec.mode.local.auto=false
hive.exec.mode.local.auto.inputbytes.max=500
hive.exec.mode.local.auto.tasks.max=10
hive.exec.mode.local.auto.input.files.max=1000
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
mapred.max.split.size=100000000
mapred.min.split.size.per.node=100000000
mapred.min.split.size.per.rack=100000000

【result】
when hive.groupby.skewindata=true the result is :
20140612 8 alibaba 1 87 147

when it=false the result is : 
20140612 8 alibaba 1 87 87

【query plan】
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
(TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR
(TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active))
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR
(TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds)
'20140612') (< (TOK_FUNCTION length (TOK_TABLE_OR_COL eventname)) 1000))
(like (TOK_TABLE_OR_COL eventname) '%tvvideo_setting%'))) (TOK_GROUPBY
(TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname)
(TOK_TABLE_OR_COL active))))

STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage

STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
eventinfo_tmp 
TableScan
alias: eventinfo_tmp
Filter Operator
predicate:
expr: ((length(eventname) < 1000) and (eventname like '%tvvideo_setting%'))
type: boolean
Select Operator
expressions:
expr: ds
type: string
expr: appid
type: string
expr: eventname
type: string
expr: active
type: int
expr: guid
type: string
outputColumnNames: ds, appid, eventname, active, guid
Group By Operator
aggregations:
expr: count(DISTINCT guid)
expr: count()
bucketGroup: false
keys:
expr: ds
type: string
expr: appid
type: string
expr: eventname
type: string
expr: active
type: int
expr: guid
type: string
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: int
expr: _col4
type: string
sort order: +++++
Map-reduce partition columns:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: int
tag: -1
value expressions:
expr: _col5
type: bigint
expr: _col6
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col4:0._col0)
expr: count(VALUE._col1)
bucketGroup: false
keys:
expr: KEY._col0
type: string
expr: KEY._col1
type: string
expr: KEY._col2
type: string
expr: KEY._col3
type: int
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: int
expr: _col4
type: bigint
expr: _col5
type: bigint
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0
Fetch Operator
limit: -1

ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
(TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR
(TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active))
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR
(TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds)
'20140612') (< (TOK_FUNCTION length (TOK_TABLE_OR_COL eventname)) 1000))
(like (TOK_TABLE_OR_COL eventname) '%tvvideo_setting%'))) (TOK_GROUPBY
(TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname)
(TOK_TABLE_OR_COL active))))

STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage

STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
eventinfo_tmp 
TableScan
alias: eventinfo_tmp
Filter Operator
predicate:
expr: ((length(eventname) < 1000) and (eventname like '%tvvideo_setting%'))
type: boolean
Select Operator
expressions:
expr: ds
type: string
expr: appid
type: string
expr: eventname
type: string
expr: active
type: int
expr: guid
type: string
outputColumnNames: ds, appid, eventname, active, guid
Group By Operator
aggregations:
expr: count(DISTINCT guid)
expr: count()
bucketGroup: false
keys:
expr: ds
type: string
expr: appid
type: string
expr: eventname
type: string
expr: active
type: int
expr: guid
type: string
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: int
expr: _col4
type: string
sort order: +++++
Map-reduce partition columns:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: int
tag: -1
value expressions:
expr: _col5
type: bigint
expr: _col6
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col4:0._col0)
expr: count(VALUE._col1)
bucketGroup: false
keys:
expr: KEY._col0
type: string
expr: KEY._col1
type: string
expr: KEY._col2
type: string
expr: KEY._col3
type: int
mode: complete
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: int
expr: _col4
type: bigint
expr: _col5
type: bigint
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0
Fetch Operator
limit: -1

Calculation works wrong when hive.groupby.skewindata is true and count(*) count(distinct) group by work simultaneously

Reply via email to