Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21605 )
Change subject: IMPALA-13247: Support Reading Puffin files for the current snapshot ...................................................................... Patch Set 8: (23 comments) http://gerrit.cloudera.org:8080/#/c/21605/6/fe/pom.xml File fe/pom.xml: http://gerrit.cloudera.org:8080/#/c/21605/6/fe/pom.xml@446 PS6, Line 446: <version>${datasketches.version}</version> > It could be externalized similarly as Iceberg. Done http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java: http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@430 PS6, Line 430: applyPuffinStats(catalogTimeline); > If there are puffin files, then you could add catalogTimeline.markEvent("Lo Done http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@463 PS6, Line 463: > nit: use +4 indent Done http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@471 PS6, Line 471: IcebergColumn col = getColumnByIcebergF > What happen if: - ndv > getTTableStats().getNum_rows(): getTTableStats().getNum_rows() is an estimation itself, so in this case I think we can go with the Puffin value (also an estimation). - ndv == -1 (or any other negative value): I added a check to only set the number of distinct values if 'ndv' is non-negative. In this case possibly existing HMS stats values will not be overridden. http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java File fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java: http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@23 PS6, Line 23: import java.util.ArrayList; > unused import Done http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@33 PS6, Line 33: import org.apache.hadoop.fs.Path; > unused import Done http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@58 PS6, Line 58: this.blob = blo > Can this be negative (unknown)? If yes, please normalize to -1. If not, ple It comes from input outside of Impala, so it could be negative (OTOH I don't think a Theta sketch can produce a negative value). I handled that case in the new patch in IcebergTable.applyPuffinStats(), the caller of this class. I think it's better to handle it there and this class simply loads and forwards the information it reads from Puffin files, it does not modify it. http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@70 PS6, Line 70: } : return result; : } catch (Exception e) { : // We add a catch-all clause here because Puffin stats loading shouldn't be an > You could put the try-catch inside the for-loop. This CATCH should normally never be triggered, but I added it in case an unexpected exception is thrown somewhere. The query shouldn't fail because of Puffin stats loading, this is a last guard against unexpected exceptions. That's why it encompasses the whole for loop. You're right that it is inconsistent then that 'result' is returned even in case of an exception. I changed that so now we return an empty map in that case. http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@86 PS6, Line 86: file. If we run into an error reading the cont > Why don't you import org.apache.iceberg.exceptions.NotFoundException? Done http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@94 PS6, Line 94: 'UncheckedIOExcepti > Why don't you import java.io.IOException? Done http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@84 PS6, Line 84: : // Keep track of the Iceberg column field ids for which we read statistics from this : // Puffin file. If we run into an error reading the contents of the file, the file may : // be corrupt so we want to remove values already read from it from the overall : // result. : List<Integer> fieldIdsFromFile = new ArrayList<>(); : try { : PuffinReader puffinReader = createPuffinReader(statsFile); : List<BlobMetadata> blobs = getBlobs(puffinReader, currentSnapshotId); : : // The 'UncheckedIOException' can be thrown from the 'next()' method of the : // iterator. Statistics that are loaded successfully before an exception is thrown : // are discarded because the file is probably corrupt. : for (Pair<BlobMetadata, ByteBuffer> puffinData: puffinReader.readAll(blobs)) { : BlobMetadata blobMetadata = puffinData.first(); : ByteBuffer blobData = puffinData.second(); : : loadStatsFromBlob(blobMetadata, blobData, iceTable, statsFile, : result, fieldIdsFromFile); : } : } cat > I think the code would be more readable if you'd combine the different try- Done, left two CATCH branches because the 'fileMissing' parameter to the logWarning() function is different in the two cases. http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@150 PS6, Line 150: > nit: you could put this to the next line, as we usually don't break line in Done http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@166 PS6, Line 166: private static byte[] getBytes(ByteBuffer byteBuffer) { > It could be a format string. Done http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@173 PS6, Line 173: > At some places there's ', here I see `. I think it would be better to be co Done http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/pom.xml File java/puffin-data-generator/pom.xml: http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/pom.xml@1 PS6, Line 1: <?xml version="1.0"?> > nit: this project could be added as a module to the parent pom Done http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/pom.xml@30 PS6, Line 30: <artifactId>impala-puffin-data-generator</artifactId> > Ther artifactId clashes with datagenerator subproject Done http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/pom.xml@77 PS6, Line 77: <version>${datasketches.version}</version> > See comment on fe/pom.xml:446 Done http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java File java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java: http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@18 PS6, Line 18: > nit: package names are looking better without underscores Done http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@35 PS6, Line 35: > unused imports on L41, L43-45, L47 Done http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@131 PS6, Line 131: > sequneceNumber is always 0, it can be omitted. Done http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@301 PS6, Line 301: > There could be one test case where all puffin files are corrupt. Done http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@396 PS6, Line 396: > IOException is more general, FileNotFoundException and UnsupportedEncodingE Done http://gerrit.cloudera.org:8080/#/c/21605/6/tests/query_test/test_iceberg.py File tests/query_test/test_iceberg.py: http://gerrit.cloudera.org:8080/#/c/21605/6/tests/query_test/test_iceberg.py@1807 PS6, Line 1807: string_col STRING, > Using Done -- To view, visit http://gerrit.cloudera.org:8080/21605 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I50c1228988960a686d08a9b2942e01e366678866 Gerrit-Change-Number: 21605 Gerrit-PatchSet: 8 Gerrit-Owner: Daniel Becker <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 22 Aug 2024 14:20:15 +0000 Gerrit-HasComments: Yes
