Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21605 )

Change subject: IMPALA-13247: Support Reading Puffin files for the current 
snapshot
......................................................................


Patch Set 6:

(11 comments)

Good job, it was nice to see the thorough testing!

http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@430
PS6, Line 430:         applyPuffinStats();
If there are puffin files, then you could add catalogTimeline.markEvent("Loaded 
additional Puffin stats");


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@463
PS6, Line 463:
nit: use +4 indent


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java
File fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java:

http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@70
PS6, Line 70: catch (Exception e) {
            :       // We add a catch-all clause here because Puffin stats 
loading shouldn't be an
            :       // error, it should not stop the query.
            :       LOG.warn("Unexpected error happened while loading Puffin 
column statistics.");
You could put the try-catch inside the for-loop.

Currently we return 'result' on the first exception we encounter. I don't see 
why we shouldn't look into the other statsfiles.

OTOH, is this try-catch even needed anymore? Seems like you catch all checked 
exceptions inside loadStatsFromFile().


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@86
PS6, Line 86: org.apache.iceberg.exceptions.NotFoundException
Why don't you import org.apache.iceberg.exceptions.NotFoundException?


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@94
PS6, Line 94: java.io.IOException
Why don't you import java.io.IOException?


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@84
PS6, Line 84:     try {
            :       puffinReader = createPuffinReader(statsFile);
            :     } catch (org.apache.iceberg.exceptions.NotFoundException e) {
            :       logWarning(iceTable.getFullName(), statsFile.path(), true, 
e);
            :       return;
            :     }
            :
            :     List<BlobMetadata> blobs;
            :     try {
            :       blobs = getBlobs(puffinReader, currentSnapshotId);
            :     } catch (java.io.IOException e) {
            :       logWarning(iceTable.getFullName(), statsFile.path(), false, 
e);
            :       return;
            :     }
            :
            :     // Keep track of the Iceberg column field ids for which we 
read statistics from this
            :     // Puffin file. If we run into an error reading the contents 
of the file, the file may
            :     // be corrupt so we want to remove values already read from 
it from the overall
            :     // result.
            :     List<Integer> fieldIdsFromFile = new ArrayList<>();
            :     try {
I think the code would be more readable if you'd combine the different 
try-catch blocks:

  try {
    // do all the stuff
  } catch (IOException | NotFoundException | UncheckedIOException e) {
    logWarning(iceTable.getFullName(), statsFile.path(), false, e);
    result.keySet().removeAll(fieldIdsFromFile);
  }


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@150
PS6, Line 150: Map<Integer,
nit: you could put this to the next line, as we usually don't break line in the 
middle of a type/parameter


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@173
PS6, Line 173: `
At some places there's ', here I see `. I think it would be better to be 
consistent. I think ' is better than `.


http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java
File 
java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java:

http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@80
PS6, Line 80: PuffinDataGenerator
Nice work writing this class!


http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@301
PS6, Line 301: One of the Puffin files is corrupt
There could be one test case where all puffin files are corrupt.


http://gerrit.cloudera.org:8080/#/c/21605/6/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/21605/6/tests/query_test/test_iceberg.py@1807
PS6, Line 1807:         timestamp_col TIMESTAMP) STORED BY ICEBERG"""
Using

 alter table t1 set column stats x ('numDVs'='2000');

you could even check the fallback to HMS stats.



--
To view, visit http://gerrit.cloudera.org:8080/21605
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I50c1228988960a686d08a9b2942e01e366678866
Gerrit-Change-Number: 21605
Gerrit-PatchSet: 6
Gerrit-Owner: Daniel Becker <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 16 Aug 2024 17:18:42 +0000
Gerrit-HasComments: Yes

Reply via email to