[jira] [Commented] (HIVE-5149) ReduceSinkDeDuplication can pick the wrong partitioning columns

Yin Huai (JIRA) Tue, 27 Aug 2013 11:00:23 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-5149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751502#comment-13751502
 ]


Yin Huai commented on HIVE-5149:
--------------------------------

Another example, in reduce_deduplicate_extended.q, there is 
{code:sql}
explain from (select key, value from src group by key, value) s select s.key 
group by s.key;
{\code}
The plan is 
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        s:src 
          TableScan
            alias: src
            Select Operator
              expressions:
                    expr: key
                    type: string
                    expr: value
                    type: string
              outputColumnNames: key, value
              Group By Operator
                bucketGroup: false
                keys:
                      expr: key
                      type: string
                      expr: value
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: string
                  sort order: ++
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: string
                  tag: -1
      Reduce Operator Tree:
        Group By Operator
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
                expr: KEY._col1
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
            outputColumnNames: _col0
            Group By Operator
              bucketGroup: false
              keys:
                    expr: _col0
                    type: string
              mode: complete
              outputColumnNames: _col0
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                outputColumnNames: _col0
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{code}

I think the plan is wrong. We should use key as the partitioning column to make 
sure all rows associated with the same key will be assigned to the same reducer.
                
> ReduceSinkDeDuplication can pick the wrong partitioning columns
> ---------------------------------------------------------------
>
>                 Key: HIVE-5149
>                 URL: https://issues.apache.org/jira/browse/HIVE-5149
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>
> https://mail-archives.apache.org/mod_mbox/hive-user/201308.mbox/%3CCAG6Lhyex5XPwszpihKqkPRpzri2k=m4qgc+cpar5yvr8sjt...@mail.gmail.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-5149) ReduceSinkDeDuplication can pick the wrong partitioning columns

Reply via email to