[ https://issues.apache.org/jira/browse/HIVE-24775?focusedWorklogId=554763&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-554763 ]
ASF GitHub Bot logged work on HIVE-24775: ----------------------------------------- Author: ASF GitHub Bot Created on: 19/Feb/21 13:57 Start Date: 19/Feb/21 13:57 Worklog Time Spent: 10m Work Description: kasakrisz opened a new pull request #1995: URL: https://github.com/apache/hive/pull/1995 ### What changes were proposed in this pull request? When transforming plan of materialized view rebuild to incremental rebuild in case of the view definition has aggregate: * Use null safe equality operators in join condition instead of equality operators * Introduce a Project on top of MV scan having all columns from the view plus a boolean literal which indicates whether the row with the key values coming from the joinRightInput exists in the view. Use this flag in `CalcitePlanner.fixUpASTAggregateIncrementalRebuild` to decide which branch a row should go: update or insert. * Add null checks of aggregated values when calculated new aggregated values. ### Why are the changes needed? Rows with null aggregate keys and aggregated values was not handled by incremental MV rebuild and it could lead to data corruption. ### Does this PR introduce _any_ user-facing change? Yes. Query result changes. See jira for example. ### How was this patch tested? ``` mvn test -Dtest.output.overwrite -DskipSparkTests -Dtest=TestMiniLlapLocalCliDriver -Dqfile=materialized_view_create_rewrite_nulls.q,materialized_view_create_rewrite_4.q,materialized_view_create_rewrite_one_key_gby.q -pl itests/qtest -Pitests ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 554763) Time Spent: 40m (was: 0.5h) > Incorrect null handling when rebuilding Materialized view incrementally > ----------------------------------------------------------------------- > > Key: HIVE-24775 > URL: https://issues.apache.org/jira/browse/HIVE-24775 > Project: Hive > Issue Type: Bug > Reporter: Krisztian Kasa > Assignee: Krisztian Kasa > Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > {code} > CREATE TABLE t1 (a int, b varchar(256), c decimal(10,2), d int) STORED AS orc > TBLPROPERTIES ('transactional'='true'); > INSERT INTO t1 VALUES > (NULL, 'null_value', 100.77, 7), > (1, 'calvin', 978.76, 3), > (1, 'charlie', 9.8, 1); > CREATE MATERIALIZED VIEW mat1 TBLPROPERTIES ('transactional'='true') AS > SELECT a, b, sum(d) > FROM t1 > WHERE c > 10.0 > GROUP BY a, b; > INSERT INTO t1 VALUES > (NULL, 'null_value', 100.88, 8), > (1, 'charlie', 15.8, 1); > ALTER MATERIALIZED VIEW mat1 REBUILD; > SELECT * FROM mat1 > ORDER BY a, b; > {code} > View contains: > {code} > 1 calvin 3 > 1 charlie 1 > NULL null_value 8 > NULL null_value 7 > {code} > but it should contain: > {code} > 1 calvin 3 > 1 charlie 1 > NULL null_value 15 > {code} > Rows with aggregate key columns having NULL values are not aggregated because > incremental materialized view rebuild plan is altered by > [applyPreJoinOrderingTransforms|https://github.com/apache/hive/blob/76732ad27e139fbdef25b820a07cf35934771083/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L1975]: > IS NOT NULL filter added for each of these columns on top of the view scan > when joining with the branch pulls the rows inserted after the last rebuild: > {code} > HiveProject($f0=[$3], $f1=[$4], $f2=[CASE(AND(IS NULL($0), IS NULL($1)), $5, > +($5, $2))]) > HiveFilter(condition=[OR(AND(IS NULL($0), IS NULL($1)), AND(=($0, $3), > =($1, $4)))]) > HiveJoin(condition=[AND(=($0, $3), =($1, $4))], joinType=[right], > algorithm=[none], cost=[not available]) > HiveProject(a=[$0], b=[$1], _c2=[$2]) > HiveFilter(condition=[AND(IS NOT NULL($0), IS NOT NULL($1))]) > HiveTableScan(table=[[default, mat1]], table:alias=[default.mat1]) > HiveProject(a=[$0], b=[$1], $f2=[$2]) > HiveAggregate(group=[{0, 1}], agg#0=[sum($3)]) > HiveFilter(condition=[AND(<(1, $6.writeid), >($2, 10))]) > HiveTableScan(table=[[default, t1]], table:alias=[t1]) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)