Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/21605 )
Change subject: IMPALA-13247: Support Reading Puffin files for the current snapshot ...................................................................... Patch Set 6: (11 comments) Good job, it was nice to see the thorough testing! http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java: http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@430 PS6, Line 430: applyPuffinStats(); If there are puffin files, then you could add catalogTimeline.markEvent("Loaded additional Puffin stats"); http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@463 PS6, Line 463: nit: use +4 indent http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java File fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java: http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@70 PS6, Line 70: catch (Exception e) { : // We add a catch-all clause here because Puffin stats loading shouldn't be an : // error, it should not stop the query. : LOG.warn("Unexpected error happened while loading Puffin column statistics."); You could put the try-catch inside the for-loop. Currently we return 'result' on the first exception we encounter. I don't see why we shouldn't look into the other statsfiles. OTOH, is this try-catch even needed anymore? Seems like you catch all checked exceptions inside loadStatsFromFile(). http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@86 PS6, Line 86: org.apache.iceberg.exceptions.NotFoundException Why don't you import org.apache.iceberg.exceptions.NotFoundException? http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@94 PS6, Line 94: java.io.IOException Why don't you import java.io.IOException? http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@84 PS6, Line 84: try { : puffinReader = createPuffinReader(statsFile); : } catch (org.apache.iceberg.exceptions.NotFoundException e) { : logWarning(iceTable.getFullName(), statsFile.path(), true, e); : return; : } : : List<BlobMetadata> blobs; : try { : blobs = getBlobs(puffinReader, currentSnapshotId); : } catch (java.io.IOException e) { : logWarning(iceTable.getFullName(), statsFile.path(), false, e); : return; : } : : // Keep track of the Iceberg column field ids for which we read statistics from this : // Puffin file. If we run into an error reading the contents of the file, the file may : // be corrupt so we want to remove values already read from it from the overall : // result. : List<Integer> fieldIdsFromFile = new ArrayList<>(); : try { I think the code would be more readable if you'd combine the different try-catch blocks: try { // do all the stuff } catch (IOException | NotFoundException | UncheckedIOException e) { logWarning(iceTable.getFullName(), statsFile.path(), false, e); result.keySet().removeAll(fieldIdsFromFile); } http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@150 PS6, Line 150: Map<Integer, nit: you could put this to the next line, as we usually don't break line in the middle of a type/parameter http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@173 PS6, Line 173: ` At some places there's ', here I see `. I think it would be better to be consistent. I think ' is better than `. http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java File java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java: http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@80 PS6, Line 80: PuffinDataGenerator Nice work writing this class! http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@301 PS6, Line 301: One of the Puffin files is corrupt There could be one test case where all puffin files are corrupt. http://gerrit.cloudera.org:8080/#/c/21605/6/tests/query_test/test_iceberg.py File tests/query_test/test_iceberg.py: http://gerrit.cloudera.org:8080/#/c/21605/6/tests/query_test/test_iceberg.py@1807 PS6, Line 1807: timestamp_col TIMESTAMP) STORED BY ICEBERG""" Using alter table t1 set column stats x ('numDVs'='2000'); you could even check the fallback to HMS stats. -- To view, visit http://gerrit.cloudera.org:8080/21605 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I50c1228988960a686d08a9b2942e01e366678866 Gerrit-Change-Number: 21605 Gerrit-PatchSet: 6 Gerrit-Owner: Daniel Becker <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 16 Aug 2024 17:18:42 +0000 Gerrit-HasComments: Yes
