[ https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413481#comment-17413481 ]
Nemon Lou edited comment on HIVE-24579 at 9/11/21, 7:11 AM: ------------------------------------------------------------ Thanks [~kkasa] for your attention. This issue only happens on a customer's cluster, and i could not get the data. This simplified reproduce step seems not match the customer's issue. Here is the original issue(with table name changed): The query result is different for the same store_id when change limit 10 to limit 100 {code:sql} SELECT store_id store_id_hive , count(1) device_cnt_bound_30day FROM db_name.table_name WHERE i_rep_date <= 20201226 AND i_rep_date >= cast(from_unixtime(unix_timestamp('20201226','yyyyMMdd')-86400*29,'yyyyMMdd') as int) AND nvl(is_curr_bound,1) = 1 group by store_id limit 10; {code} query plan : {code:sql} | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:10 | | Stage-1 | | Reducer 2 | | File Output Operator [FS_8] | | Limit [LIM_7] (rows=10 width=39) | | Number of rows:10 | | Group By Operator [GBY_5] (rows=5618832 width=39) | | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 | | <-Map 1 [SIMPLE_EDGE] | | SHUFFLE [RS_4] | | PartitionCols:_col0 | | Group By Operator [GBY_3] (rows=11237665 width=39) | | Output:["_col0","_col1"],aggregations:["count()"],keys:store_id | | Select Operator [SEL_2] (rows=11237665 width=39) | | Output:["store_id"] | | Filter Operator [FIL_9] (rows=11237665 width=39) | | predicate:(NVL(is_curr_bound,1) = 1) | | TableScan [TS_0] (rows=22475330 width=39) | | db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"] | {code} part of the extended plan: {code:sql} +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | STAGE DEPENDENCIES: | | Stage-1 is a root stage | | Stage-0 depends on stages: Stage-1 | | | | STAGE PLANS: | | Stage: Stage-1 | | Tez | | DagId: omm_20201228025339_1ef293cf-c508-431a-bf00-6df95178c6e8:3229 | | Edges: | | Reducer 2 <- Map 1 (SIMPLE_EDGE) | | DagName: omm_20201228025339_1ef293cf-c508-431a-bf00-6df95178c6e8:3229 | | Vertices: | | Map 1 | | Map Operator Tree: | | TableScan | | alias: table_name | | Statistics: Num rows: 22475330 Data size: 876537870 Basic stats: COMPLETE Column stats: NONE | | GatherStats: false | | Filter Operator | | isSamplingPred: false | | predicate: (NVL(is_curr_bound,1) = 1) (type: boolean) | | Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE | | Select Operator | | expressions: store_id (type: string) | | outputColumnNames: store_id | | Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE | | Group By Operator | | aggregations: count() | | keys: store_id (type: string) | | mode: hash | | outputColumnNames: _col0, _col1 | | Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE | | Reduce Output Operator | | key expressions: _col0 (type: string) | | null sort order: a | | sort order: + | | Map-reduce partition columns: _col0 (type: string) | | Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE | | tag: -1 | | TopN: 10 | | TopN Hash Memory Usage: 0.1 | | value expressions: _col1 (type: bigint) | | auto parallelism: true | {code} was (Author: nemon): Thanks [~kkasa] for your attention. This issue only happens on a customer's cluster, and i could not get the data. This simplified reproduce step seems not match the customer's issue. Here is the original issue(with table name changed): The query result is different for the same store_id when change limit 10 to limit 100 {code:sql} SELECT store_id store_id_hive , count(1) device_cnt_bound_30day FROM db_name.table_name WHERE i_rep_date <= 20201226 AND i_rep_date >= cast(from_unixtime(unix_timestamp('20201226','yyyyMMdd')-86400*29,'yyyyMMdd') as int) AND nvl(is_curr_bound,1) = 1 group by store_id limit 10; {code} query plan : {code:sql} | Explain | +----------------------------------------------------+ | Plan optimized by CBO. | | | | Vertex dependency in root stage | | Reducer 2 <- Map 1 (SIMPLE_EDGE) | | | | Stage-0 | | Fetch Operator | | limit:10 | | Stage-1 | | Reducer 2 | | File Output Operator [FS_8] | | Limit [LIM_7] (rows=10 width=39) | | Number of rows:10 | | Group By Operator [GBY_5] (rows=5618832 width=39) | | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 | | <-Map 1 [SIMPLE_EDGE] | | SHUFFLE [RS_4] | | PartitionCols:_col0 | | Group By Operator [GBY_3] (rows=11237665 width=39) | | Output:["_col0","_col1"],aggregations:["count()"],keys:store_id | | Select Operator [SEL_2] (rows=11237665 width=39) | | Output:["store_id"] | | Filter Operator [FIL_9] (rows=11237665 width=39) | | predicate:(NVL(is_curr_bound,1) = 1) | | TableScan [TS_0] (rows=22475330 width=39) | | db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"] | {code} part of the extended plan: {code:sql} | Reduce Output Operator | | key expressions: _col0 (type: string) | | null sort order: a | | sort order: + | | Map-reduce partition columns: _col0 (type: string) | | Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE | | tag: -1 | | TopN: 10 | | TopN Hash Memory Usage: 0.1 | | value expressions: _col1 (type: bigint) | | auto parallelism: true | {code} > Incorrect Result For Groupby With Limit > --------------------------------------- > > Key: HIVE-24579 > URL: https://issues.apache.org/jira/browse/HIVE-24579 > Project: Hive > Issue Type: Bug > Affects Versions: 2.3.7, 3.1.2, 4.0.0 > Reporter: Nemon Lou > Priority: Major > > {code:sql} > create table test(id int); > explain extended select id,count(*) from test group by id limit 10; > {code} > There is an TopN unexpectly for map phase, which casues incorrect result. > {code:sql} > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5 > Edges: > Reducer 2 <- Map 1 (SIMPLE_EDGE) > DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: test > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > GatherStats: false > Select Operator > expressions: id (type: int) > outputColumnNames: id > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > Group By Operator > aggregations: count() > keys: id (type: int) > mode: hash > outputColumnNames: _col0, _col1 > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: _col0 (type: int) > null sort order: a > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > tag: -1 > TopN: 10 > TopN Hash Memory Usage: 0.1 > value expressions: _col1 (type: bigint) > auto parallelism: true > Execution mode: vectorized > Path -> Alias: > file:/user/hive/warehouse/test [test] > Path -> Partition: > file:/user/hive/warehouse/test > Partition > base file name: test > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > properties: > COLUMN_STATS_ACCURATE > {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}} > bucket_count -1 > bucketing_version 2 > column.name.delimiter , > columns id > columns.comments > columns.types int > file.inputformat org.apache.hadoop.mapred.TextInputFormat > file.outputformat > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > location file:/user/hive/warehouse/test > name default.test > numFiles 0 > numRows 0 > rawDataSize 0 > serialization.ddl struct test { i32 id} > serialization.format 1 > serialization.lib > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > totalSize 0 > transient_lastDdlTime 1609730190 > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > properties: > COLUMN_STATS_ACCURATE > {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}} > bucket_count -1 > bucketing_version 2 > column.name.delimiter , > columns id > columns.comments > columns.types int > file.inputformat > org.apache.hadoop.mapred.TextInputFormat > file.outputformat > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > location file:/user/hive/warehouse/test > name default.test > numFiles 0 > numRows 0 > rawDataSize 0 > serialization.ddl struct test { i32 id} > serialization.format 1 > serialization.lib > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > totalSize 0 > transient_lastDdlTime 1609730190 > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > name: default.test > name: default.test > Truncated Path -> Alias: > /test [test] > Reducer 2 > Execution mode: vectorized > Needs Tagging: false > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > keys: KEY._col0 (type: int) > mode: mergepartial > outputColumnNames: _col0, _col1 > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > Limit > Number of rows: 10 > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > GlobalTableId: 0 > directory: > file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002 > NumFilesPerFileSink: 1 > Statistics: Num rows: 1 Data size: 13500 Basic stats: > COMPLETE Column stats: NONE > Stats Publishing Key Prefix: > file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002/ > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > properties: > columns _col0,_col1 > columns.types int:bigint > escape.delim \ > hive.serialization.extend.additional.nesting.levels > true > serialization.escape.crlf true > serialization.format 1 > serialization.lib > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > TotalFiles: 1 > GatherStats: false > MultiFileSpray: false > Stage: Stage-0 > Fetch Operator > limit: 10 > Processor Tree: > ListSink > Time taken: 0.102 seconds, Fetched: 143 row(s) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)