[ 
https://issues.apache.org/jira/browse/HIVE-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Kasa reassigned HIVE-21481:
-------------------------------------

    Assignee: Krisztian Kasa

> MERGE correctness issues with null safe equality
> ------------------------------------------------
>
>                 Key: HIVE-21481
>                 URL: https://issues.apache.org/jira/browse/HIVE-21481
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning
>            Reporter: Vineet Garg
>            Assignee: Krisztian Kasa
>            Priority: Major
>
> The way Hive currently generates plan for MERGE statement can lead to wrong 
> results with null safe equality.
> To illustrate consider the following reproducer
> {code:sql}
> create table ttarget(s string, j int, flag string) stored as orc 
> tblproperties("transactional"="true");
> truncate table ttarget;
> insert into ttarget values('not_null', 1, 'dont udpate'), (null,2, 'update');
> create table tsource (i int);
> insert into tsource values(null),(2);
> {code}
> Let's say you have the following MERGE statement
> {code:sql}
> explain merge into ttarget using tsource on i<=>j
>  when matched THEN
>       UPDATE set flag='updated'
>  when not matched THEN
>       INSERT VALUES('new', 1999, 'true');
> {code}
> With this MERGE {{*ONLY ONE*}} row should match in target which should be 
> updated. But currently due to the plan hive generate it will end up matching 
> both rows.
> This is because MERGE statement is rewritten into RIGHT OUTER JOIN + FILTER 
> corresponding to all branches.
> The part of the plan generated by hive for this statement consist of:
> {noformat}
> Map 2
>             Map Operator Tree:
>                 TableScan
>                   alias: tsource
>                   Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: NONE
>                   Map Join Operator
>                     condition map:
>                          Right Outer Join 0 to 1
>                     keys:
>                       0 j (type: int)
>                       1 i (type: int)
>                     nullSafes: [true]
>                     outputColumnNames: _col0, _col1, _col5, _col6
>                     input vertices:
>                       0 Map 1
>                     Statistics: Num rows: 1 Data size: 206 Basic stats: 
> COMPLETE Column stats: NONE
>                     HybridGraceHashJoin: true
>                     Filter Operator
>                       predicate: (_col6 IS NOT DISTINCT FROM _col1) (type: 
> boolean)
>                       Statistics: Num rows: 1 Data size: 206 Basic stats: 
> COMPLETE Column stats: NONE
>                       Select Operator
>                         expressions: _col5 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>), _col0 (type: string), 
> _col1 (type: int)
>                         outputColumnNames: _col0, _col1, _col2
>                         Statistics: Num rows: 1 Data size: 206 Basic stats: 
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           key expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                           sort order: +
>                           Map-reduce partition columns: UDFToInteger(_col0) 
> (type: int)
>                           Statistics: Num rows: 1 Data size: 206 Basic stats: 
> COMPLETE Column stats: NONE
>                           value expressions: _col1 (type: string), _col2 
> (type: int)
> {noformat}
> Result after JOIN will be :
> {code:sql}
> select s,j,i from ttarget right outer join tsource on i<=>j ;
> NULL  NULL    NULL
> NULL  NULL    2
> {code}
> On this resultset predicate {{(_col6 IS NOT DISTINCT FROM _col1)}} will be 
> true for both resulting into both rows matching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to