SV: Duplicate rows when using group by in subquery

Mikael Öhman Thu, 19 Sep 2013 01:28:51 -0700

Hello again.

I have now checked out latest code from trunk and built as per instructions.


However, this query:

select a.Symbol, count(*) 
from (select Symbol, catid from cat group by Symbol, catid) a 
group by a.Symbol;

still returns an incorrect number of rows for table:

create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, 
INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID 
string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, 
TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format 
delimited fields terminated by ',' stored as ORC;


Here is the result of EXPLAIN:


hive> EXPLAIN select a.Symbol, count(*) 
    > from (select Symbol, catid from cat group by Symbol, catid) a 
    > group by a.Symbol;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF 
(TOK_TABNAME cat))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) 
(TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL Symbol)) (TOK_SELEXPR 
(TOK_TABLE_OR_COL catid))) (TOK_GROUPBY (TOK_TABLE_OR_COL Symbol) 
(TOK_TABLE_OR_COL catid)))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) Symbol)) 
(TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) 
Symbol))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:cat 
          TableScan
            alias: cat
            Select Operator
              expressions:
                    expr: symbol
                    type: string
                    expr: catid
                    type: bigint
              outputColumnNames: symbol, catid
              Group By Operator
                bucketGroup: false
                keys:
                      expr: symbol
                      type: string
                      expr: catid
                      type: bigint
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: bigint
                  sort order: ++
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: bigint
                  tag: -1
      Reduce Operator Tree:
        Group By Operator
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
                expr: KEY._col1
                type: bigint
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
            outputColumnNames: _col0
            Group By Operator
              aggregations:
                    expr: count()
              bucketGroup: false
              keys:
                    expr: _col0
                    type: string
              mode: complete
              outputColumnNames: _col0, _col1
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: bigint
                outputColumnNames: _col0, _col1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1


Using set hive.optimize.reducededuplication=false;
I get 2 mapreduce jobs and the correct number of rows (24).

Can I verify somehow, maybe through looking in the source code, that I indeed 
have the correct version? Or execute a command from hive cli that shows version 
etc. Just built from source this morning so seems strange that the bug would 
still persist :(.



________________________________
 Från: Yin Huai <huaiyin....@gmail.com>
Till: user@hive.apache.org; Mikael Öhman <mikael_u...@yahoo.se> 
Skickat: tisdag, 17 september 2013 15:30
Ämne: Re: Duplicate rows when using group by in subquery
 


Hello Mikael,

ReduceSinkDeduplication automatically kicked in because it is enabled by 
default. The original plan tries to shuffle your data twice. Then, 
ReduceSinkDeduplication finds that the original plan can be optimized to 
shuffle your data once. But, when picking the partitioning columns, this 
optimizer picked the wrong columns because of the bug. 

Also, you can try your query with and without ReduceSinkDeduplication (use "set 
hive.optimize.reducededuplication=false;" to turn this optimization off), and 
see the performance. If the cardinality of "a.Symbol" limits the degree of 
parallelism, two MapReduce jobs may be faster.

Thanks,

Yin



On Tue, Sep 17, 2013 at 2:24 AM, Mikael Öhman <mikael_u...@yahoo.se> wrote:

Thank you for the information. Just to be clear, it is not that I have manually 
restricted the job to run using only a single mapreduce job, but it incorrectly 
assumes one job is enough?
> 
>I will get back with results from your suggestions ASAP; unfortunately I don't 
>have the machines available until Thursday.
> 
>/ Sincerely Mikael
>
>
>Från: Yin Huai <huaiyin....@gmail.com>
>Till: user@hive.apache.org; Mikael Öhman <mikael_u...@yahoo.se> 
>Skickat: måndag, 16 september 2013 19:52
>Ämne: Re: Duplicate rows when using group by in subquery
> 
>
>
>Hello Mikael,
>
>
>Seems your case is related to the bug reported in 
>https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive uses a 
>single MapReduce job to evaluate your query, "c.Symbol" and "c.catid" are used 
>to partitioning data, and thus, rows with the same value of "c.Symbol" are not 
>correctly grouped. If your case, only "c.Symbol" should be used if we want to 
>use a single MapReduce job to evaluate this query. Can you check the query 
>plan (results of "EXPLAIN") and see if both "c.Symbol" and "c.catid" appear in 
>partitioning columns? Or, you can also attach your query plan.
>
>
>This bug have been fixed in trunk. Can you test your query in trunk and see if 
>the result is correct. If you are using hive 0.11, you can also apply the 0.11 
>patch 
>(https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch).
>
>
>Thanks,
>
>
>Yin
>
>
>
>On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <mikael_u...@yahoo.se> wrote:
>
>Hello.
>> 
>>This is basically the same question I posted on stackoverflow: 
>>http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115
>> 
>>I know the query is a bit noisy. But this query also demonstrates the error:
>> 
>>select a.symbol from (select symbol, ordertype from cat group by symbol, 
>>ordertype) a group by a.symbol;
>> 
>>Now, this query may not make much sense but in my case, because I have 24 
>>symbols, I expect a result of 24 rows. But I get 48 rows back. A similar 
>>query:
>> 
>>select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group by 
>>c.Symbol,c.catid) a group by a.Symbol;
>> 
>>returns 864 rows, where I still expect 24 rows... If there are alternatives 
>>as to how to write the original query in my SO post I would much appreciate 
>>hearing them. The examples given in this mail have just been provided to 
>>demonstrate the problem using easier to understand queries and I don't need 
>>advice on them.
>> 
>>The .csv data and example is from a toy example. My real setup is 6 nodes, 
>>and the table definition is:
>> 
>>create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE 
>>int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, 
>>ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, 
>>SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) 
>>row format delimited fields terminated by ',' stored as ORC;
>>set hive.exec.dynamic.partition=true;
>>set hive.exec.dynamic.partition.mode=nonstrict;
>>set hive.exec.max.dynamic.partitions.pernode=1000;
>> 
>>insert...
>> 
>>Thank you so much for any input.
>> 
>>/Sincerely Mikael
>
>
>

SV: Duplicate rows when using group by in subquery

Reply via email to