Hello again. I have now checked out latest code from trunk and built as per instructions.
However, this query: select a.Symbol, count(*) from (select Symbol, catid from cat group by Symbol, catid) a group by a.Symbol; still returns an incorrect number of rows for table: create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format delimited fields terminated by ',' stored as ORC; Here is the result of EXPLAIN: hive> EXPLAIN select a.Symbol, count(*) > from (select Symbol, catid from cat group by Symbol, catid) a > group by a.Symbol; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME cat))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL Symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL catid))) (TOK_GROUPBY (TOK_TABLE_OR_COL Symbol) (TOK_TABLE_OR_COL catid)))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) Symbol)) (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) Symbol)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a:cat TableScan alias: cat Select Operator expressions: expr: symbol type: string expr: catid type: bigint outputColumnNames: symbol, catid Group By Operator bucketGroup: false keys: expr: symbol type: string expr: catid type: bigint mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: bigint sort order: ++ Map-reduce partition columns: expr: _col0 type: string expr: _col1 type: bigint tag: -1 Reduce Operator Tree: Group By Operator bucketGroup: false keys: expr: KEY._col0 type: string expr: KEY._col1 type: bigint mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col0 type: string outputColumnNames: _col0 Group By Operator aggregations: expr: count() bucketGroup: false keys: expr: _col0 type: string mode: complete outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 Using set hive.optimize.reducededuplication=false; I get 2 mapreduce jobs and the correct number of rows (24). Can I verify somehow, maybe through looking in the source code, that I indeed have the correct version? Or execute a command from hive cli that shows version etc. Just built from source this morning so seems strange that the bug would still persist :(. ________________________________ Från: Yin Huai <huaiyin....@gmail.com> Till: user@hive.apache.org; Mikael Öhman <mikael_u...@yahoo.se> Skickat: tisdag, 17 september 2013 15:30 Ämne: Re: Duplicate rows when using group by in subquery Hello Mikael, ReduceSinkDeduplication automatically kicked in because it is enabled by default. The original plan tries to shuffle your data twice. Then, ReduceSinkDeduplication finds that the original plan can be optimized to shuffle your data once. But, when picking the partitioning columns, this optimizer picked the wrong columns because of the bug. Also, you can try your query with and without ReduceSinkDeduplication (use "set hive.optimize.reducededuplication=false;" to turn this optimization off), and see the performance. If the cardinality of "a.Symbol" limits the degree of parallelism, two MapReduce jobs may be faster. Thanks, Yin On Tue, Sep 17, 2013 at 2:24 AM, Mikael Öhman <mikael_u...@yahoo.se> wrote: Thank you for the information. Just to be clear, it is not that I have manually restricted the job to run using only a single mapreduce job, but it incorrectly assumes one job is enough? > >I will get back with results from your suggestions ASAP; unfortunately I don't >have the machines available until Thursday. > >/ Sincerely Mikael > > >Från: Yin Huai <huaiyin....@gmail.com> >Till: user@hive.apache.org; Mikael Öhman <mikael_u...@yahoo.se> >Skickat: måndag, 16 september 2013 19:52 >Ämne: Re: Duplicate rows when using group by in subquery > > > >Hello Mikael, > > >Seems your case is related to the bug reported in >https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive uses a >single MapReduce job to evaluate your query, "c.Symbol" and "c.catid" are used >to partitioning data, and thus, rows with the same value of "c.Symbol" are not >correctly grouped. If your case, only "c.Symbol" should be used if we want to >use a single MapReduce job to evaluate this query. Can you check the query >plan (results of "EXPLAIN") and see if both "c.Symbol" and "c.catid" appear in >partitioning columns? Or, you can also attach your query plan. > > >This bug have been fixed in trunk. Can you test your query in trunk and see if >the result is correct. If you are using hive 0.11, you can also apply the 0.11 >patch >(https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch). > > >Thanks, > > >Yin > > > >On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <mikael_u...@yahoo.se> wrote: > >Hello. >> >>This is basically the same question I posted on stackoverflow: >>http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115 >> >>I know the query is a bit noisy. But this query also demonstrates the error: >> >>select a.symbol from (select symbol, ordertype from cat group by symbol, >>ordertype) a group by a.symbol; >> >>Now, this query may not make much sense but in my case, because I have 24 >>symbols, I expect a result of 24 rows. But I get 48 rows back. A similar >>query: >> >>select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group by >>c.Symbol,c.catid) a group by a.Symbol; >> >>returns 864 rows, where I still expect 24 rows... If there are alternatives >>as to how to write the original query in my SO post I would much appreciate >>hearing them. The examples given in this mail have just been provided to >>demonstrate the problem using easier to understand queries and I don't need >>advice on them. >> >>The .csv data and example is from a toy example. My real setup is 6 nodes, >>and the table definition is: >> >>create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE >>int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, >>ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, >>SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) >>row format delimited fields terminated by ',' stored as ORC; >>set hive.exec.dynamic.partition=true; >>set hive.exec.dynamic.partition.mode=nonstrict; >>set hive.exec.max.dynamic.partitions.pernode=1000; >> >>insert... >> >>Thank you so much for any input. >> >>/Sincerely Mikael > > >