Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/20753 )
Change subject: IMPALA-12597: Basic Equality delete read support for Iceberg tables ...................................................................... Patch Set 10: (10 comments) http://gerrit.cloudera.org:8080/#/c/20753/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20753/8//COMMIT_MSG@14 PS8, Line 14: data sequence number of a delete file has to : be greater than the data sequence number of the data file being : investigated > nit: improvement idea for the future: This optimization sounds reasonable for me. created IMPALA-12649 for this. (copy-pasted your explanation to the ticket:) ) http://gerrit.cloudera.org:8080/#/c/20753/8//COMMIT_MSG@39 PS8, Line 39: hav > have Done http://gerrit.cloudera.org:8080/#/c/20753/8//COMMIT_MSG@45 PS8, Line 45: table > nit: tables Done http://gerrit.cloudera.org:8080/#/c/20753/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/20753/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@504 PS8, Line 504: equalityDeleteFiles_.add(delFileDesc.first); : addEqualityIds(delFile.equalityFieldIds()); : } else { : Preconditions.checkState(delFile.content() == FileContent.POSITION_DELETES); : positionDeleteFiles_.add(delFileDesc.first); : } : } : } : } > nit: for readability, this could go to its own method. Done http://gerrit.cloudera.org:8080/#/c/20753/8/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/20753/8/testdata/data/README@776 PS8, Line 776: when running a DELETE FROM statement > Could we also hack INSERT INTO? In that case Impala would just write the Pa I think that is also a valid way to hack these eq-deletes. Next time I'll try it this way :) http://gerrit.cloudera.org:8080/#/c/20753/8/tests/query_test/test_iceberg.py File tests/query_test/test_iceberg.py: http://gerrit.cloudera.org:8080/#/c/20753/8/tests/query_test/test_iceberg.py@1262 PS8, Line 1262: e > nit: we usually don't add space here Done http://gerrit.cloudera.org:8080/#/c/20753/8/tests/query_test/test_iceberg.py@1264 PS8, Line 1264: : @SkipIfDockerizedCluster.internal_hostname : @SkipIf.hardcoded_uris : def test_multiple_equality_ids(self, unique_database): : """This test loads an Iceberg table that has 2 equality delete files with different : equality ID lists. A query on such a table fails due to lack of support.""" : SRC_DIR = os.path.join(os.environ['IMPALA_HOME'], : "testdata/data/iceberg_test/hadoop_catalog/ice/" : "iceberg_v2_delete_different_equality_ids") : DST_DIR = "/test-warehouse/iceberg_test/hadoop_catalog/ice/" \ : "iceberg_v2_delete_different_equality_ids" > Can we use 'file_utils.create_iceberg_table_from_directory'? With equality Good Idea. I gave this a try, however I realized that for this particular table it's not possible to actually load it as there are different equality fields ID lists, so even an ALTER TABLE after the CREATE TABLE would give an error. So in practice for hadoop catalog I had to add a number of 'is hadoop catalog' kind of checks inot the function plus I had to also make an option to not run the last ALTER TABLE statement in the function. I found this too much complexity being added for this particular table. In general for adding tables with this function into hadoop catalog would be great but it wouldn;t work in this case. http://gerrit.cloudera.org:8080/#/c/20753/8/tests/query_test/test_iceberg.py@1278 PS8, Line 1278: > Is there a reason we create an external table? If I create it as a non-external table I get an error saying that the table already exists. Anyway, create_iceberg_table_from_directory() also creates an external table http://gerrit.cloudera.org:8080/#/c/20753/8/tests/query_test/test_iceberg.py@1290 PS8, Line 1290: : err = self.execute_query_e > What are these lists? These are the mismatching equality ID lists. Added a comment for more clarity. http://gerrit.cloudera.org:8080/#/c/20753/10/tests/query_test/test_iceberg.py File tests/query_test/test_iceberg.py: http://gerrit.cloudera.org:8080/#/c/20753/10/tests/query_test/test_iceberg.py@1267 PS10, Line 1267: def test_multiple_equality_ids(self, unique_database): Added the test table to unique_database so that I can save on level of try-catch block -- To view, visit http://gerrit.cloudera.org:8080/20753 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2053e6f321c69f1c82059a84a5d99aeaa9814cad Gerrit-Change-Number: 20753 Gerrit-PatchSet: 10 Gerrit-Owner: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Andrew Sherman <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tamas Mate <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 18 Dec 2023 15:07:58 +0000 Gerrit-HasComments: Yes
