[jira] [Updated] (HIVE-24579) Incorrect Result For Groupby With Limit

Nemon Lou (Jira) Sun, 03 Jan 2021 22:18:38 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nemon Lou updated HIVE-24579:
-----------------------------
    Description: 
{code:sql}
create table test(id int);
explain extended select id,count(*) from test group by id limit 10;
{code}

There is an TopN unexpectly for map phase, which casues incorrect result.


{code:sql}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: root_20210104140946_940cd4ce-8bb5-41ac-91ec-1185245da009:4
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
      DagName: root_20210104140946_940cd4ce-8bb5-41ac-91ec-1185245da009:4
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: test
                  Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                  GatherStats: false
                  Select Operator
                    expressions: id (type: int)
                    outputColumnNames: id
                    Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                    Top N Key Operator
                      sort order: +
                      keys: id (type: int)
                      Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                      top n: 10
                      Group By Operator
                        aggregations: count()
                        keys: id (type: int)
                        mode: hash
                        outputColumnNames: _col0, _col1
                        Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: _col0 (type: int)
                          null sort order: a
                          sort order: +
                          Map-reduce partition columns: _col0 (type: int)
                          Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                          tag: -1
                          TopN: 10
                          TopN Hash Memory Usage: 0.1
                          value expressions: _col1 (type: bigint)
                          auto parallelism: true
            Execution mode: vectorized
            Path -> Alias:
              file:/user/hive/warehouse/test [test]
            Path -> Partition:
              file:/user/hive/warehouse/test 
                Partition
                  base file name: test
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    COLUMN_STATS_ACCURATE 
{"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
                    bucket_count -1
                    bucketing_version 2
                    column.name.delimiter ,
                    columns id
                    columns.comments 
                    columns.types int
                    file.inputformat org.apache.hadoop.mapred.TextInputFormat
                    file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    location file:/user/hive/warehouse/test
                    name default.test
                    numFiles 0
                    numRows 0
                    rawDataSize 0
                    serialization.ddl struct test { i32 id}
                    serialization.format 1
                    serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    totalSize 0
                    transient_lastDdlTime 1609730190
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    properties:
                      COLUMN_STATS_ACCURATE 
{"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
                      bucket_count -1
                      bucketing_version 2
                      column.name.delimiter ,
                      columns id
                      columns.comments 
                      columns.types int
                      file.inputformat org.apache.hadoop.mapred.TextInputFormat
                      file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      location file:/user/hive/warehouse/test
                      name default.test
                      numFiles 0
                      numRows 0
                      rawDataSize 0
                      serialization.ddl struct test { i32 id}
                      serialization.format 1
                      serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      totalSize 0
                      transient_lastDdlTime 1609730190
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: default.test
                  name: default.test
            Truncated Path -> Alias:
              /test [test]
        Reducer 2 
            Execution mode: vectorized
            Needs Tagging: false
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE 
Column stats: NONE
                Limit
                  Number of rows: 10
                  Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    GlobalTableId: 0
                    directory: 
file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-09-46_584_5422661027690569952-1/-mr-10001/.hive-staging_hive_2021-01-04_14-09-46_584_5422661027690569952-1/-ext-10002
                    NumFilesPerFileSink: 1
                    Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                    Stats Publishing Key Prefix: 
file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-09-46_584_5422661027690569952-1/-mr-10001/.hive-staging_hive_2021-01-04_14-09-46_584_5422661027690569952-1/-ext-10002/
                    table:
                        input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        properties:
                          columns _col0,_col1
                          columns.types int:bigint
                          escape.delim \
                          hive.serialization.extend.additional.nesting.levels 
true
                          serialization.escape.crlf true
                          serialization.format 1
                          serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    TotalFiles: 1
                    GatherStats: false
                    MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink

Time taken: 0.116 seconds, Fetched: 148 row(s)


{code}






 

  was:
{code:sql}
create table test(id int);
explain extended select id,count(*) from test group by id limit 10;
{code}

There is an TopN unexpectly for map phase, which casues incorrect result.


{code:sql}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: root_20210104113831_2451d621-8f77-4a29-9da6-3a65bc4d9e56:2
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
      DagName: root_20210104113831_2451d621-8f77-4a29-9da6-3a65bc4d9e56:2
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: test
                  Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                  GatherStats: false
                  Select Operator
                    expressions: id (type: int)
                    outputColumnNames: id
                    Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                    Group By Operator
                      aggregations: count()
                      keys: id (type: int)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        null sort order: a
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                        tag: -1
                        value expressions: _col1 (type: bigint)
                        auto parallelism: true
            Execution mode: vectorized
            Path -> Alias:
              file:/user/hive/warehouse/test [test]
            Path -> Partition:
              file:/user/hive/warehouse/test 
                Partition
                  base file name: test
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    COLUMN_STATS_ACCURATE 
{"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
                    bucket_count -1
                    bucketing_version 2
                    column.name.delimiter ,
                    columns id
                    columns.comments 
                    columns.types int
                    file.inputformat org.apache.hadoop.mapred.TextInputFormat
                    file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    location file:/user/hive/warehouse/test
                    name default.test
                    numFiles 0
                    numRows 0
                    rawDataSize 0
                    serialization.ddl struct test { i32 id}
                    serialization.format 1
                    serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    totalSize 0
                    transient_lastDdlTime 1609730190
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    properties:
                      COLUMN_STATS_ACCURATE 
{"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
                      bucket_count -1
                      bucketing_version 2
                      column.name.delimiter ,
                      columns id
                      columns.comments 
                      columns.types int
                      file.inputformat org.apache.hadoop.mapred.TextInputFormat
                      file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      location file:/user/hive/warehouse/test
                      name default.test
                      numFiles 0
                      numRows 0
                      rawDataSize 0
                      serialization.ddl struct test { i32 id}
                      serialization.format 1
                      serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      totalSize 0
                      transient_lastDdlTime 1609730190
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: default.test
                  name: default.test
            Truncated Path -> Alias:
              /test [test]
        Reducer 2 
            Execution mode: vectorized
            Needs Tagging: false
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE 
Column stats: NONE
                Limit
                  Number of rows: 10
                  Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    GlobalTableId: 0
                    directory: 
file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_11-38-31_601_4363062670409846390-1/-mr-10001/.hive-staging_hive_2021-01-04_11-38-31_601_4363062670409846390-1/-ext-10002
                    NumFilesPerFileSink: 1
                    Statistics: Num rows: 1 Data size: 13500 Basic stats: 
COMPLETE Column stats: NONE
                    Stats Publishing Key Prefix: 
file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_11-38-31_601_4363062670409846390-1/-mr-10001/.hive-staging_hive_2021-01-04_11-38-31_601_4363062670409846390-1/-ext-10002/
                    table:
                        input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        properties:
                          columns _col0,_col1
                          columns.types int:bigint
                          escape.delim \
                          hive.serialization.extend.additional.nesting.levels 
true
                          serialization.escape.crlf true
                          serialization.format 1
                          serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    TotalFiles: 1
                    GatherStats: false
                    MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink

Time taken: 0.111 seconds, Fetched: 141 row(s)

{code}






 


> Incorrect Result For Groupby With Limit
> ---------------------------------------
>
>                 Key: HIVE-24579
>                 URL: https://issues.apache.org/jira/browse/HIVE-24579
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.3.7, 3.1.2, 4.0.0
>            Reporter: Nemon Lou
>            Priority: Critical
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: root_20210104140946_940cd4ce-8bb5-41ac-91ec-1185245da009:4
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>       DagName: root_20210104140946_940cd4ce-8bb5-41ac-91ec-1185245da009:4
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: test
>                   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                   GatherStats: false
>                   Select Operator
>                     expressions: id (type: int)
>                     outputColumnNames: id
>                     Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                     Top N Key Operator
>                       sort order: +
>                       keys: id (type: int)
>                       Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                       top n: 10
>                       Group By Operator
>                         aggregations: count()
>                         keys: id (type: int)
>                         mode: hash
>                         outputColumnNames: _col0, _col1
>                         Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           key expressions: _col0 (type: int)
>                           null sort order: a
>                           sort order: +
>                           Map-reduce partition columns: _col0 (type: int)
>                           Statistics: Num rows: 1 Data size: 13500 Basic 
> stats: COMPLETE Column stats: NONE
>                           tag: -1
>                           TopN: 10
>                           TopN Hash Memory Usage: 0.1
>                           value expressions: _col1 (type: bigint)
>                           auto parallelism: true
>             Execution mode: vectorized
>             Path -> Alias:
>               file:/user/hive/warehouse/test [test]
>             Path -> Partition:
>               file:/user/hive/warehouse/test 
>                 Partition
>                   base file name: test
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                   properties:
>                     COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>                     bucket_count -1
>                     bucketing_version 2
>                     column.name.delimiter ,
>                     columns id
>                     columns.comments 
>                     columns.types int
>                     file.inputformat org.apache.hadoop.mapred.TextInputFormat
>                     file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     location file:/user/hive/warehouse/test
>                     name default.test
>                     numFiles 0
>                     numRows 0
>                     rawDataSize 0
>                     serialization.ddl struct test { i32 id}
>                     serialization.format 1
>                     serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     totalSize 0
>                     transient_lastDdlTime 1609730190
>                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                 
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     properties:
>                       COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>                       bucket_count -1
>                       bucketing_version 2
>                       column.name.delimiter ,
>                       columns id
>                       columns.comments 
>                       columns.types int
>                       file.inputformat 
> org.apache.hadoop.mapred.TextInputFormat
>                       file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       location file:/user/hive/warehouse/test
>                       name default.test
>                       numFiles 0
>                       numRows 0
>                       rawDataSize 0
>                       serialization.ddl struct test { i32 id}
>                       serialization.format 1
>                       serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                       totalSize 0
>                       transient_lastDdlTime 1609730190
>                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     name: default.test
>                   name: default.test
>             Truncated Path -> Alias:
>               /test [test]
>         Reducer 2 
>             Execution mode: vectorized
>             Needs Tagging: false
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 keys: KEY._col0 (type: int)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                 Limit
>                   Number of rows: 10
>                   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                   File Output Operator
>                     compressed: false
>                     GlobalTableId: 0
>                     directory: 
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-09-46_584_5422661027690569952-1/-mr-10001/.hive-staging_hive_2021-01-04_14-09-46_584_5422661027690569952-1/-ext-10002
>                     NumFilesPerFileSink: 1
>                     Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                     Stats Publishing Key Prefix: 
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-09-46_584_5422661027690569952-1/-mr-10001/.hive-staging_hive_2021-01-04_14-09-46_584_5422661027690569952-1/-ext-10002/
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         properties:
>                           columns _col0,_col1
>                           columns.types int:bigint
>                           escape.delim \
>                           hive.serialization.extend.additional.nesting.levels 
> true
>                           serialization.escape.crlf true
>                           serialization.format 1
>                           serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     TotalFiles: 1
>                     GatherStats: false
>                     MultiFileSpray: false
>   Stage: Stage-0
>     Fetch Operator
>       limit: 10
>       Processor Tree:
>         ListSink
> Time taken: 0.116 seconds, Fetched: 148 row(s)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24579) Incorrect Result For Groupby With Limit

Reply via email to