[ https://issues.apache.org/jira/browse/HIVE-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Kasa reassigned HIVE-21481: ------------------------------------- Assignee: Krisztian Kasa > MERGE correctness issues with null safe equality > ------------------------------------------------ > > Key: HIVE-21481 > URL: https://issues.apache.org/jira/browse/HIVE-21481 > Project: Hive > Issue Type: Bug > Components: Query Planning > Reporter: Vineet Garg > Assignee: Krisztian Kasa > Priority: Major > > The way Hive currently generates plan for MERGE statement can lead to wrong > results with null safe equality. > To illustrate consider the following reproducer > {code:sql} > create table ttarget(s string, j int, flag string) stored as orc > tblproperties("transactional"="true"); > truncate table ttarget; > insert into ttarget values('not_null', 1, 'dont udpate'), (null,2, 'update'); > create table tsource (i int); > insert into tsource values(null),(2); > {code} > Let's say you have the following MERGE statement > {code:sql} > explain merge into ttarget using tsource on i<=>j > when matched THEN > UPDATE set flag='updated' > when not matched THEN > INSERT VALUES('new', 1999, 'true'); > {code} > With this MERGE {{*ONLY ONE*}} row should match in target which should be > updated. But currently due to the plan hive generate it will end up matching > both rows. > This is because MERGE statement is rewritten into RIGHT OUTER JOIN + FILTER > corresponding to all branches. > The part of the plan generated by hive for this statement consist of: > {noformat} > Map 2 > Map Operator Tree: > TableScan > alias: tsource > Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE > Column stats: NONE > Map Join Operator > condition map: > Right Outer Join 0 to 1 > keys: > 0 j (type: int) > 1 i (type: int) > nullSafes: [true] > outputColumnNames: _col0, _col1, _col5, _col6 > input vertices: > 0 Map 1 > Statistics: Num rows: 1 Data size: 206 Basic stats: > COMPLETE Column stats: NONE > HybridGraceHashJoin: true > Filter Operator > predicate: (_col6 IS NOT DISTINCT FROM _col1) (type: > boolean) > Statistics: Num rows: 1 Data size: 206 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col5 (type: > struct<writeid:bigint,bucketid:int,rowid:bigint>), _col0 (type: string), > _col1 (type: int) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 1 Data size: 206 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: _col0 (type: > struct<writeid:bigint,bucketid:int,rowid:bigint>) > sort order: + > Map-reduce partition columns: UDFToInteger(_col0) > (type: int) > Statistics: Num rows: 1 Data size: 206 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col1 (type: string), _col2 > (type: int) > {noformat} > Result after JOIN will be : > {code:sql} > select s,j,i from ttarget right outer join tsource on i<=>j ; > NULL NULL NULL > NULL NULL 2 > {code} > On this resultset predicate {{(_col6 IS NOT DISTINCT FROM _col1)}} will be > true for both resulting into both rows matching. -- This message was sent by Atlassian Jira (v8.20.10#820010)