[ https://issues.apache.org/jira/browse/HIVE-24595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mustafa İman reassigned HIVE-24595: ----------------------------------- Assignee: Mustafa İman > Vectorization causing incorrect results for scalar subquery > ----------------------------------------------------------- > > Key: HIVE-24595 > URL: https://issues.apache.org/jira/browse/HIVE-24595 > Project: Hive > Issue Type: Bug > Components: Vectorization > Affects Versions: 3.0.0 > Reporter: Vineet Garg > Assignee: Mustafa İman > Priority: Major > > *Repro* > {code:sql} > CREATE EXTERNAL TABLE `alltypessmall`( > `id` int, > `bool_col` boolean, > `tinyint_col` tinyint, > `smallint_col` smallint, > `int_col` int, > `bigint_col` bigint, > `float_col` float, > `double_col` double, > `date_string_col` string, > `string_col` string, > `timestamp_col` timestamp) > PARTITIONED BY ( > `year` int, > `month` int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( > 'escape.delim'='\\', > 'field.delim'=',', > 'serialization.format'=',') > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > TBLPROPERTIES ( > 'DO_NOT_UPDATE_STATS'='true', > 'OBJCAPABILITIES'='EXTREAD,EXTWRITE', > 'STATS_GENERATED'='TASK', > 'impala.lastComputeStatsTime'='1608312793', > 'transient_lastDdlTime'='1608310442') ; > insert into alltypessmall partition(year=2002,month=1) values(1, true, > 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001'); > insert into alltypessmall partition(year=2002,month=1) values(1, true, > 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001'); > insert into alltypessmall partition(year=2002,month=1) values(1, true, > 3,3,40,3434,5.4,44.3,'str1','str2', '01-01-2001'); > {code} > Following query should fail but it succeeds > {code:sql} > SELECT id FROM alltypessmall > WHERE int_col = > (SELECT int_col > FROM alltypessmall) > ORDER BY id; > {code} > *Explain plan* > {code:java} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17 > Edges: > Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE) > DagName: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: alltypessmall > filterExpr: int_col is not null (type: boolean) > Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE > Column stats: COMPLETE > Filter Operator > predicate: int_col is not null (type: boolean) > Statistics: Num rows: 3 Data size: 24 Basic stats: > COMPLETE Column stats: COMPLETE > Select Operator > expressions: id (type: int), int_col (type: int) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 3 Data size: 24 Basic stats: > COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 > 1 > outputColumnNames: _col0, _col1 > input vertices: > 1 Reducer 4 > Statistics: Num rows: 3 Data size: 24 Basic stats: > COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 _col1 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col0 > input vertices: > 1 Map 3 > Statistics: Num rows: 9 Data size: 36 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > null sort order: z > sort order: + > Statistics: Num rows: 9 Data size: 36 Basic > stats: COMPLETE Column stats: COMPLETE > Execution mode: vectorized > Map 3 > Map Operator Tree: > TableScan > alias: alltypessmall > Statistics: Num rows: 3 Data size: 142 Basic stats: > COMPLETE Column stats: COMPLETE > Select Operator > Statistics: Num rows: 3 Data size: 142 Basic stats: > COMPLETE Column stats: COMPLETE > Group By Operator > aggregations: count() > minReductionHashAggr: 0.6666666 > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > null sort order: > sort order: > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: bigint) > Filter Operator > predicate: int_col is not null (type: boolean) > Statistics: Num rows: 3 Data size: 12 Basic stats: > COMPLETE Column stats: COMPLETE > Select Operator > expressions: int_col (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 3 Data size: 12 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > null sort order: z > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 3 Data size: 12 Basic stats: > COMPLETE Column stats: COMPLETE > Execution mode: vectorized > Reducer 2 > Execution mode: vectorized > Reduce Operator Tree: > Select Operator > expressions: KEY.reducesinkkey0 (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE > Column stats: COMPLETE > File Output Operator > compressed: false > Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE > Column stats: COMPLETE > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Reducer 4 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > Filter Operator > predicate: (sq_count_check(_col0) <= 1) (type: boolean) > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > Select Operator > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > null sort order: > sort order: > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: COMPLETE > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {code} > Issue is that Map 3/Reducer 4 group by is producing incorrect count. Expected > output in this case is *3* but actual seems to be *1*. > > *CBO plan* > {code:sql} > HiveSortLimit(sort0=[$0], dir0=[ASC]) > HiveProject(id=[$0]) > HiveJoin(condition=[=($1, $3)], joinType=[inner], algorithm=[none], > cost=[{6.0 rows, 0.0 cpu, 0.0 io}]) > HiveJoin(condition=[true], joinType=[inner], algorithm=[none], > cost=[{4.0 rows, 0.0 cpu, 0.0 io}]) > HiveProject(id=[$0], int_col=[$4]) > HiveFilter(condition=[IS NOT NULL($4)]) > HiveTableScan(table=[[cdpd_20765, alltypessmall]], > table:alias=[alltypessmall]) > HiveProject(cnt=[$0]) > HiveFilter(condition=[<=(sq_count_check($0), 1)]) > HiveProject(cnt=[$0]) > HiveAggregate(group=[{}], cnt=[COUNT()]) > HiveTableScan(table=[[cdpd_20765, alltypessmall]], > table:alias=[alltypessmall]) > HiveProject(int_col=[$4]) > HiveFilter(condition=[IS NOT NULL($4)]) > HiveTableScan(table=[[cdpd_20765, alltypessmall]], > table:alias=[alltypessmall]) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)