[ https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nemon Lou updated HIVE-24579: ----------------------------- Description: {code:sql} create table test(id int); explain extended select id,count(*) from test group by id limit 10; {code} There is an TopN unexpectly for map phase, which casues incorrect result. {code:sql} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: root_20210104140946_940cd4ce-8bb5-41ac-91ec-1185245da009:4 Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE) DagName: root_20210104140946_940cd4ce-8bb5-41ac-91ec-1185245da009:4 Vertices: Map 1 Map Operator Tree: TableScan alias: test Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE GatherStats: false Select Operator expressions: id (type: int) outputColumnNames: id Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE Top N Key Operator sort order: + keys: id (type: int) Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE top n: 10 Group By Operator aggregations: count() keys: id (type: int) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) null sort order: a sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE tag: -1 TopN: 10 TopN Hash Memory Usage: 0.1 value expressions: _col1 (type: bigint) auto parallelism: true Execution mode: vectorized Path -> Alias: file:/user/hive/warehouse/test [test] Path -> Partition: file:/user/hive/warehouse/test Partition base file name: test input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}} bucket_count -1 bucketing_version 2 column.name.delimiter , columns id columns.comments columns.types int file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location file:/user/hive/warehouse/test name default.test numFiles 0 numRows 0 rawDataSize 0 serialization.ddl struct test { i32 id} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 0 transient_lastDdlTime 1609730190 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}} bucket_count -1 bucketing_version 2 column.name.delimiter , columns id columns.comments columns.types int file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location file:/user/hive/warehouse/test name default.test numFiles 0 numRows 0 rawDataSize 0 serialization.ddl struct test { i32 id} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 0 transient_lastDdlTime 1609730190 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.test name: default.test Truncated Path -> Alias: /test [test] Reducer 2 Execution mode: vectorized Needs Tagging: false Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) keys: KEY._col0 (type: int) mode: mergepartial outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE Limit Number of rows: 10 Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false GlobalTableId: 0 directory: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-09-46_584_5422661027690569952-1/-mr-10001/.hive-staging_hive_2021-01-04_14-09-46_584_5422661027690569952-1/-ext-10002 NumFilesPerFileSink: 1 Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE Stats Publishing Key Prefix: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-09-46_584_5422661027690569952-1/-mr-10001/.hive-staging_hive_2021-01-04_14-09-46_584_5422661027690569952-1/-ext-10002/ table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat properties: columns _col0,_col1 columns.types int:bigint escape.delim \ hive.serialization.extend.additional.nesting.levels true serialization.escape.crlf true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false Stage: Stage-0 Fetch Operator limit: 10 Processor Tree: ListSink Time taken: 0.116 seconds, Fetched: 148 row(s) {code} was: {code:sql} create table test(id int); explain extended select id,count(*) from test group by id limit 10; {code} There is an TopN unexpectly for map phase, which casues incorrect result. {code:sql} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: root_20210104113831_2451d621-8f77-4a29-9da6-3a65bc4d9e56:2 Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE) DagName: root_20210104113831_2451d621-8f77-4a29-9da6-3a65bc4d9e56:2 Vertices: Map 1 Map Operator Tree: TableScan alias: test Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE GatherStats: false Select Operator expressions: id (type: int) outputColumnNames: id Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() keys: id (type: int) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) null sort order: a sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE tag: -1 value expressions: _col1 (type: bigint) auto parallelism: true Execution mode: vectorized Path -> Alias: file:/user/hive/warehouse/test [test] Path -> Partition: file:/user/hive/warehouse/test Partition base file name: test input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}} bucket_count -1 bucketing_version 2 column.name.delimiter , columns id columns.comments columns.types int file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location file:/user/hive/warehouse/test name default.test numFiles 0 numRows 0 rawDataSize 0 serialization.ddl struct test { i32 id} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 0 transient_lastDdlTime 1609730190 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}} bucket_count -1 bucketing_version 2 column.name.delimiter , columns id columns.comments columns.types int file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location file:/user/hive/warehouse/test name default.test numFiles 0 numRows 0 rawDataSize 0 serialization.ddl struct test { i32 id} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 0 transient_lastDdlTime 1609730190 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.test name: default.test Truncated Path -> Alias: /test [test] Reducer 2 Execution mode: vectorized Needs Tagging: false Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) keys: KEY._col0 (type: int) mode: mergepartial outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE Limit Number of rows: 10 Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false GlobalTableId: 0 directory: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_11-38-31_601_4363062670409846390-1/-mr-10001/.hive-staging_hive_2021-01-04_11-38-31_601_4363062670409846390-1/-ext-10002 NumFilesPerFileSink: 1 Statistics: Num rows: 1 Data size: 13500 Basic stats: COMPLETE Column stats: NONE Stats Publishing Key Prefix: file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_11-38-31_601_4363062670409846390-1/-mr-10001/.hive-staging_hive_2021-01-04_11-38-31_601_4363062670409846390-1/-ext-10002/ table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat properties: columns _col0,_col1 columns.types int:bigint escape.delim \ hive.serialization.extend.additional.nesting.levels true serialization.escape.crlf true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false Stage: Stage-0 Fetch Operator limit: 10 Processor Tree: ListSink Time taken: 0.111 seconds, Fetched: 141 row(s) {code} > Incorrect Result For Groupby With Limit > --------------------------------------- > > Key: HIVE-24579 > URL: https://issues.apache.org/jira/browse/HIVE-24579 > Project: Hive > Issue Type: Bug > Affects Versions: 2.3.7, 3.1.2, 4.0.0 > Reporter: Nemon Lou > Priority: Critical > > {code:sql} > create table test(id int); > explain extended select id,count(*) from test group by id limit 10; > {code} > There is an TopN unexpectly for map phase, which casues incorrect result. > {code:sql} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: root_20210104140946_940cd4ce-8bb5-41ac-91ec-1185245da009:4 > Edges: > Reducer 2 <- Map 1 (SIMPLE_EDGE) > DagName: root_20210104140946_940cd4ce-8bb5-41ac-91ec-1185245da009:4 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: test > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > GatherStats: false > Select Operator > expressions: id (type: int) > outputColumnNames: id > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > Top N Key Operator > sort order: + > keys: id (type: int) > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > top n: 10 > Group By Operator > aggregations: count() > keys: id (type: int) > mode: hash > outputColumnNames: _col0, _col1 > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: _col0 (type: int) > null sort order: a > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 1 Data size: 13500 Basic > stats: COMPLETE Column stats: NONE > tag: -1 > TopN: 10 > TopN Hash Memory Usage: 0.1 > value expressions: _col1 (type: bigint) > auto parallelism: true > Execution mode: vectorized > Path -> Alias: > file:/user/hive/warehouse/test [test] > Path -> Partition: > file:/user/hive/warehouse/test > Partition > base file name: test > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > properties: > COLUMN_STATS_ACCURATE > {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}} > bucket_count -1 > bucketing_version 2 > column.name.delimiter , > columns id > columns.comments > columns.types int > file.inputformat org.apache.hadoop.mapred.TextInputFormat > file.outputformat > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > location file:/user/hive/warehouse/test > name default.test > numFiles 0 > numRows 0 > rawDataSize 0 > serialization.ddl struct test { i32 id} > serialization.format 1 > serialization.lib > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > totalSize 0 > transient_lastDdlTime 1609730190 > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > properties: > COLUMN_STATS_ACCURATE > {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}} > bucket_count -1 > bucketing_version 2 > column.name.delimiter , > columns id > columns.comments > columns.types int > file.inputformat > org.apache.hadoop.mapred.TextInputFormat > file.outputformat > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > location file:/user/hive/warehouse/test > name default.test > numFiles 0 > numRows 0 > rawDataSize 0 > serialization.ddl struct test { i32 id} > serialization.format 1 > serialization.lib > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > totalSize 0 > transient_lastDdlTime 1609730190 > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > name: default.test > name: default.test > Truncated Path -> Alias: > /test [test] > Reducer 2 > Execution mode: vectorized > Needs Tagging: false > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > keys: KEY._col0 (type: int) > mode: mergepartial > outputColumnNames: _col0, _col1 > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > Limit > Number of rows: 10 > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > GlobalTableId: 0 > directory: > file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-09-46_584_5422661027690569952-1/-mr-10001/.hive-staging_hive_2021-01-04_14-09-46_584_5422661027690569952-1/-ext-10002 > NumFilesPerFileSink: 1 > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > Stats Publishing Key Prefix: > file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-09-46_584_5422661027690569952-1/-mr-10001/.hive-staging_hive_2021-01-04_14-09-46_584_5422661027690569952-1/-ext-10002/ > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > properties: > columns _col0,_col1 > columns.types int:bigint > escape.delim \ > hive.serialization.extend.additional.nesting.levels > true > serialization.escape.crlf true > serialization.format 1 > serialization.lib > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > TotalFiles: 1 > GatherStats: false > MultiFileSpray: false > Stage: Stage-0 > Fetch Operator > limit: 10 > Processor Tree: > ListSink > Time taken: 0.116 seconds, Fetched: 148 row(s) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)