Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/20753 )
Change subject: IMPALA-12597: Basic Equality delete read support for Iceberg tables ...................................................................... Patch Set 1: (17 comments) Left a few comments, but the change looks great! http://gerrit.cloudera.org:8080/#/c/20753/1/be/src/exec/partitioned-hash-join-builder.h File be/src/exec/partitioned-hash-join-builder.h: http://gerrit.cloudera.org:8080/#/c/20753/1/be/src/exec/partitioned-hash-join-builder.h@87 PS1, Line 87: treat_nulls_equal_ IS NOT DISTINCT FROM is a well-known SQL term, I think it would be better to keep that, but also add the additional comments about NULL-handling. http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/CatalogObjects.thrift@625 PS1, Line 625: equality_ids This might have a better place in TIcebergTable. Though I see this is probably temporary, and later we might have the equality_ids in THdfsFileDesc. http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/PlanNodes.thrift@404 PS1, Line 404: treat_nulls_equal Again, I think we should keep the SQL terminology here, but keep the additional comment about NULLs. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java File fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java: http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java@58 PS1, Line 58: except it returns True if the rhs is NULL Can we update this comment: except it returns True of both sides are NULLs. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java File fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java: http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@106 PS1, Line 106: TODO Could you please add IMPALA-12598? http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@108 PS1, Line 108: column nit: columns http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java: http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@160 PS1, Line 160: || isDeleteRowsJoin_ This wouldn't be needed if we passed Operator.NOT_DISTINCT in the equality predicates. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@197 PS1, Line 197: positionDeleteFiles_.isEmpty() && equalityDeleteFiles_.isEmpty() nit: for readability, it might be worth to extract this condition to a 'noDeleteFiles()' method http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@258 PS1, Line 258: addVirtualDataSeqNumSlot(tblRef_); nit: this is always needed for equality deletes, so this method call could be moved to addEqualityColumnSlots(). http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@323 PS1, Line 323: tblRef.getDesc().getSlots().stream() : .filter(s -> s.getVirtualColumnType() == : TVirtualColumnType.ICEBERG_DATA_SEQUENCE_NUMBER) : .findFirst() Maybe SingleNodePlanner.addSlotRefToDesc() could return he slot desc. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@405 PS1, Line 405: dataSlotDesc.getColumn() instanceof IcebergColumn This must be always true for non-virtual columns, right? http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@424 PS1, Line 424: data file table? http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@429 PS1, Line 429: Operator.EQ Could we have Operator.NOT_DISTINCT here? http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@464 PS1, Line 464: if (getIceTable().getIcebergApiTable().schemas().size() > 1) { : throw new ImpalaRuntimeException("Equality delete files are not supported for " + : "tables with schema evolution"); : } Why do we have this restriction? We throw an error if there are files with different delete columns, or if a delete column is not present in the table. Other than these cases, what problems can happen with schema evolution? http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@486 PS1, Line 486: IcebergEqualityDeleteTable deleteTable = : new IcebergEqualityDeleteTable(getIceTable(), : getIceTable().getName() + "-EQUALITY-DELETE-" + deleteScanNodeId.toString(), : equalityDeleteFiles_, equalityIds_, equalityDeletesRecordCount_); : analyzer_.addVirtualTable(deleteTable); : : TableRef deleteTblRef = TableRef.newTableRef(analyzer_, : Arrays.asList(deleteTable.getDb().getName(), deleteTable.getName()), : tblRef_.getUniqueAlias() + "-equality-delete-" + deleteScanNodeId.toString()); : addVirtualDataSeqNumSlot(deleteTblRef); nit: This is similar to what we have for position delete tables at L249. Can we create a helper method for this? http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@501 PS1, Line 501: Collections.emptyList(), Maybe we could have a TODO+Jira about adding conjuncts that could be applied to the delete columns. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@517 PS1, Line 517: joinNode.setIsDeleteRowsJoin(); If we passed Operator.NOT_DISTINCT, we wouldn't need to set this, so the hash eq conjuncts woul appear in the plans. I.e., it would be easier to verify the correctness of the plans. -- To view, visit http://gerrit.cloudera.org:8080/20753 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2053e6f321c69f1c82059a84a5d99aeaa9814cad Gerrit-Change-Number: 20753 Gerrit-PatchSet: 1 Gerrit-Owner: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Andrew Sherman <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tamas Mate <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 07 Dec 2023 11:12:00 +0000 Gerrit-HasComments: Yes
