[Impala-ASF-CR] IMPALA-12588: Don't UPDATE rows that already have the desired value

Noemi Pap-Takacs (Code Review) Wed, 26 Feb 2025 08:21:33 -0800

Noemi Pap-Takacs has uploaded a new patch set (#9). ( 
http://gerrit.cloudera.org:8080/22407 )


Change subject: IMPALA-12588: Don't UPDATE rows that already have the desired 
value
......................................................................

IMPALA-12588: Don't UPDATE rows that already have the desired value

When UPDATEing an Iceberg or Kudu table, we should change as few rows
as possible. In case of Iceberg tables it means writing as few new
data records and delete records as possible.
Therefore, if rows already have the new values we should just ignore
them. One way to achieve this is to add extra predicates, e.g.:

  UPDATE tbl SET k = 3 WHERE i > 4;
    ==>
  UPDATE tbl SET k = 3 WHERE i > 4 AND k IS DISTINCT FROM 3;

So we won't write new data/delete records for the rows that already have
the desired value.

Explanation on how to create extra predicates to filter out these rows:

If there are multiple assignments in the SET list, we can only skip
updating a row if all the mentioned values are already equal.
If either of the values needs to be updated, the entire row does.
Therefore we can think of the SET list as predicates connected with AND
and all of them need to be taken into consideration.
To negate this SET list, we have to negate the individual SET
assignments and connect them with OR.
Then add this new compound predicate to the original where predicates
with an AND (if there were none, just create a WHERE predicate from it).

                AND
              /     \
      original        OR
 WHERE predicate    /    \
                  !a       OR
                         /    \
                       !b     !c

This simple graph illustrates how the where predicate is rewritten.
(Considering an UPDATE statement that sets 3 columns.)
'!a', '!b' and '!c' are the negations of the individual assignments in
the SET list. So the extended WHERE predicate is:
(original WHERE predicate) AND (!a OR !b OR !c)
To handle NULL values correctly, we use IS DISTINCT FROM instead of
simply negating the assignment with operator '!='.

If the assignments contain UDFs, the result might be inconsistent
because of possible non-deterministic values or state in the UDFs,
therefore we should not rewrite the WHERE predicate at all.

Evaluating expressions can be expensive, therefore this optimization
can be limited or switched off entirely using the Query Option
SKIP_UNNEEDED_UPDATES_COL_LIMIT. By default, there is no filtering
if more than 10 assignments are in the SET list.

-------------------------------------------------------------------
Some performance measurements on a tpch lineitem table:

- predicates in HASH join, all updates can be skipped
Q1/(Q2 adds extra 4 items to the SET list):
update t set t.l_suppkey = s.l_suppkey,
  (t.l_partkey=s.l_partkey,
  t.l_quantity=s.l_quantity,
  t.l_returnflag=s.l_returnflag,
  t.l_shipmode=s.l_shipmode)
from ice_lineitem t join ice_lineitem s
on t.l_orderkey=s.l_orderkey and t.l_linenumber=s.l_linenumber;

- predicates in HASH join, all rows need to be updated
Q3: update t set
 t.l_suppkey = s.l_suppkey,
 t.l_partkey=s.l_partkey,
 t.l_quantity=s.l_quantity,
 t.l_returnflag=s.l_returnflag,
 t.l_shipmode=concat(s.l_shipmode,' ')
 from ice_lineitem t join ice_lineitem s
 on t.l_orderkey=s.l_orderkey and t.l_linenumber=s.l_linenumber;

- predicates pushed down to the scanner, all rows updated
Q4/(Q5 adds extra 8 items to the SET ist):
update ice_lineitem set
  (l_suppkey = l_suppkey + 0,
  l_partkey=l_partkey + 0,
  l_quantity=l_quantity,
  l_returnflag=l_returnflag,
  l_tax = l_tax,
  l_discount= l_discount,
  l_comment = l_comment,
  l_receiptdate = l_receiptdate,)
l_shipmode=concat(l_shipmode,' ');

+=======+============+==========+======+
| Query | unfiltered | filtered | diff |
+=======+============+==========+======+
| Q1    |        4.1 |      1.9 | -46% |
+-------+------------+----------+------+
| Q2    |        4.2 |      2.1 | -50% |
+-------+------------+----------+------+
| Q3    |        4.3 |      4.7 | +10% |
+-------+------------+----------+------+
| Q4    |        3.0 |      3.0 | +0%  |
+-------+------------+----------+------+
| Q5    |        3.1 |      3.1 | +0%  |
+-------+------------+----------+------+

The results show that in the best case (we can skip all rows)
this change can cause significant perf improvement ~50%, since
0 rows were written. See Q1 and Q2.
If the predicates are evaluated in the join operator, but there were
no matches (worst case scenario) we can lose about 10%. (Q3)
If all the predicates can be pushed down to the scanners, the change
does not seem to cause significant difference (~0% in Q4 and Q5)
even if all rows have to be updated.

Testing:
 - Analysis
 - Planner
 - E2E
 - Kudu
 - Iceberg
 - testing the new query option: SKIP_UNNEEDED_UPDATES_COL_LIMIT
Change-Id: I926c80e8110de5a4615a3624a81a330f54317c8b
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java
M fe/src/main/java/org/apache/impala/analysis/ModifyStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java
M fe/src/test/java/org/apache/impala/planner/ColumnsTest.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/kudu-dml-with-utc-conversion.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-update.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_update.test
M tests/query_test/test_iceberg.py
15 files changed, 405 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/22407/9
--
To view, visit http://gerrit.cloudera.org:8080/22407
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I926c80e8110de5a4615a3624a81a330f54317c8b
Gerrit-Change-Number: 22407
Gerrit-PatchSet: 9
Gerrit-Owner: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Steve Carlin <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>

[Impala-ASF-CR] IMPALA-12588: Don't UPDATE rows that already have the desired value

Reply via email to