[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413144#comment-17413144
 ] 

Krisztian Kasa commented on HIVE-24579:
---------------------------------------

[~nemon]
The TopN is introduced because the Reduce Sink operator does sorting by the id 
column and there is also a limit 10 defined. So it broadcast only the first 10 
rows to the reducer.

Could you please provide some test data which can be used to reproduce the 
incorrect result you mention in the summary?

> Incorrect Result For Groupby With Limit
> ---------------------------------------
>
>                 Key: HIVE-24579
>                 URL: https://issues.apache.org/jira/browse/HIVE-24579
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.3.7, 3.1.2, 4.0.0
>            Reporter: Nemon Lou
>            Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>       DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: test
>                   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                   GatherStats: false
>                   Select Operator
>                     expressions: id (type: int)
>                     outputColumnNames: id
>                     Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                     Group By Operator
>                       aggregations: count()
>                       keys: id (type: int)
>                       mode: hash
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: a
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                         tag: -1
>                         TopN: 10
>                         TopN Hash Memory Usage: 0.1
>                         value expressions: _col1 (type: bigint)
>                         auto parallelism: true
>             Execution mode: vectorized
>             Path -> Alias:
>               file:/user/hive/warehouse/test [test]
>             Path -> Partition:
>               file:/user/hive/warehouse/test 
>                 Partition
>                   base file name: test
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                   properties:
>                     COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>                     bucket_count -1
>                     bucketing_version 2
>                     column.name.delimiter ,
>                     columns id
>                     columns.comments 
>                     columns.types int
>                     file.inputformat org.apache.hadoop.mapred.TextInputFormat
>                     file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     location file:/user/hive/warehouse/test
>                     name default.test
>                     numFiles 0
>                     numRows 0
>                     rawDataSize 0
>                     serialization.ddl struct test { i32 id}
>                     serialization.format 1
>                     serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     totalSize 0
>                     transient_lastDdlTime 1609730190
>                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                 
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     properties:
>                       COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>                       bucket_count -1
>                       bucketing_version 2
>                       column.name.delimiter ,
>                       columns id
>                       columns.comments 
>                       columns.types int
>                       file.inputformat 
> org.apache.hadoop.mapred.TextInputFormat
>                       file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       location file:/user/hive/warehouse/test
>                       name default.test
>                       numFiles 0
>                       numRows 0
>                       rawDataSize 0
>                       serialization.ddl struct test { i32 id}
>                       serialization.format 1
>                       serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                       totalSize 0
>                       transient_lastDdlTime 1609730190
>                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     name: default.test
>                   name: default.test
>             Truncated Path -> Alias:
>               /test [test]
>         Reducer 2 
>             Execution mode: vectorized
>             Needs Tagging: false
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 keys: KEY._col0 (type: int)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                 Limit
>                   Number of rows: 10
>                   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                   File Output Operator
>                     compressed: false
>                     GlobalTableId: 0
>                     directory: 
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002
>                     NumFilesPerFileSink: 1
>                     Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                     Stats Publishing Key Prefix: 
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002/
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         properties:
>                           columns _col0,_col1
>                           columns.types int:bigint
>                           escape.delim \
>                           hive.serialization.extend.additional.nesting.levels 
> true
>                           serialization.escape.crlf true
>                           serialization.format 1
>                           serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     TotalFiles: 1
>                     GatherStats: false
>                     MultiFileSpray: false
>   Stage: Stage-0
>     Fetch Operator
>       limit: 10
>       Processor Tree:
>         ListSink
> Time taken: 0.102 seconds, Fetched: 143 row(s)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to