[ https://issues.apache.org/jira/browse/HIVE-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873093#comment-13873093 ]
Navis commented on HIVE-6099: ----------------------------- sure {noformat} hive> explain select a.dt,count(distinct a.user) as AllDist,count(distinct case when a.type = 111 then a.user else null end) as Type111User, > count(distinct case when a.type != 111 then a.user else null end) as Type123User from Table_A a > group by a.dt > ; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME Table_A) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) dt)) (TOK_SELEXPR (TOK_FUNCTIONDI count (. (TOK_TABLE_OR_COL a) user)) AllDist) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when (= (. (TOK_TABLE_OR_COL a) type) 111) (. (TOK_TABLE_OR_COL a) user) TOK_NULL)) Type111User) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when (!= (. (TOK_TABLE_OR_COL a) type) 111) (. (TOK_TABLE_OR_COL a) user) TOK_NULL)) Type123User)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) dt)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Select Operator expressions: expr: dt type: string expr: user type: string expr: type type: int outputColumnNames: dt, user, type Group By Operator aggregations: expr: count(DISTINCT user) expr: count(DISTINCT CASE WHEN ((type = 111)) THEN (user) ELSE (null) END) expr: count(DISTINCT CASE WHEN ((type <> 111)) THEN (user) ELSE (null) END) bucketGroup: false keys: expr: dt type: string expr: user type: string expr: CASE WHEN ((type = 111)) THEN (user) ELSE (null) END type: string expr: CASE WHEN ((type <> 111)) THEN (user) ELSE (null) END type: string mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: string expr: _col3 type: string sort order: ++++ Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col4 type: bigint expr: _col5 type: bigint expr: _col6 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col1:0._col0) expr: count(DISTINCT KEY._col1:1._col0) expr: count(DISTINCT KEY._col1:2._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3 Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint expr: _col2 type: bigint expr: _col3 type: bigint outputColumnNames: _col0, _col1, _col2, _col3 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 {noformat} > Multi insert does not work properly with distinct count > ------------------------------------------------------- > > Key: HIVE-6099 > URL: https://issues.apache.org/jira/browse/HIVE-6099 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.9.0, 0.10.0 > Reporter: Pavan Gadam Manohar > Labels: count, distinct, insert, multi-insert > > Need 2 rows to reproduce this Bug. Here are the steps. > Step 1) Create a table Table_A > CREATE EXTERNAL TABLE Table_A > ( > user string > , type int > ) > PARTITIONED BY (dt string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '|' > STORED AS RCFILE > LOCATION '/hive/<path>/Table_A'; > Step 2) Scenario: Lets us say consider user tommy belong to both usertypes > 111 and 123. Insert 2 records into the table created above. > select * from Table_A; > hive> select * from table_a; > OK > tommy 123 2013-12-02 > tommy 111 2013-12-02 > Step 3) Create 2 destination tables to simulate multi-insert. > CREATE EXTERNAL TABLE dest_Table_A > ( > p_date string > , Distinct_Users int > , Type111Users int > , Type123Users int > ) > PARTITIONED BY (dt string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '|' > STORED AS RCFILE > LOCATION '/hive/<path>/dest_Table_A'; > > CREATE EXTERNAL TABLE dest_Table_B > ( > p_date string > , Distinct_Users int > , Type111Users int > , Type123Users int > ) > PARTITIONED BY (dt string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '|' > STORED AS RCFILE > LOCATION '/hive/<path>/dest_Table_B'; > Step 4) Multi insert statement > from Table_A a > INSERT OVERWRITE TABLE dest_Table_A PARTITION(dt='2013-12-02') > select a.dt > ,count(distinct a.user) as AllDist > ,count(distinct case when a.type = 111 then a.user else null end) as > Type111User > ,count(distinct case when a.type != 111 then a.user else null end) as > Type123User > group by a.dt > > INSERT OVERWRITE TABLE dest_Table_B PARTITION(dt='2013-12-02') > select a.dt > ,count(distinct a.user) as AllDist > ,count(distinct case when a.type = 111 then a.user else null end) as > Type111User > ,count(distinct case when a.type != 111 then a.user else null end) as > Type123User > group by a.dt > ; > > Step 5) Verify results. > hive> select * from dest_table_a; > OK > 2013-12-02 2 1 1 2013-12-02 > Time taken: 0.116 seconds > hive> select * from dest_table_b; > OK > 2013-12-02 2 1 1 2013-12-02 > Time taken: 0.13 seconds > Conclusion: Hive gives a count of 2 for distinct users although there is > only one distinct user. After trying many datasets observed that Hive is > doing Type111Users + Typoe123Users = DistinctUsers which is wrong. > hive> select count(distinct a.user) from table_a a; > Gives: > Total MapReduce CPU Time Spent: 4 seconds 350 msec > OK > 1 -- This message was sent by Atlassian JIRA (v6.1.5#6160)