Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21605 )

Change subject: IMPALA-13247: Support Reading Puffin files for the current 
snapshot
......................................................................


Patch Set 8:

(23 comments)

http://gerrit.cloudera.org:8080/#/c/21605/6/fe/pom.xml
File fe/pom.xml:

http://gerrit.cloudera.org:8080/#/c/21605/6/fe/pom.xml@446
PS6, Line 446:       <version>${datasketches.version}</version>
> It could be externalized similarly as Iceberg.
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@430
PS6, Line 430:         applyPuffinStats(catalogTimeline);
> If there are puffin files, then you could add catalogTimeline.markEvent("Lo
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@463
PS6, Line 463:
> nit: use +4 indent
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@471
PS6, Line 471:   IcebergColumn col = getColumnByIcebergF
> What happen if:
- ndv > getTTableStats().getNum_rows(): getTTableStats().getNum_rows() is an 
estimation itself, so in this case I think we can go with the Puffin value 
(also an estimation).
- ndv == -1 (or any other negative value): I added a check to only set the 
number of distinct values if 'ndv' is non-negative. In this case possibly 
existing HMS stats values will not be overridden.


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java
File fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java:

http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@23
PS6, Line 23: import java.util.ArrayList;
> unused import
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@33
PS6, Line 33: import org.apache.hadoop.fs.Path;
> unused import
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@58
PS6, Line 58: this.blob = blo
> Can this be negative (unknown)? If yes, please normalize to -1. If not, ple
It comes from input outside of Impala, so it could be negative (OTOH I don't 
think a Theta sketch can produce a negative value). I handled that case in the 
new patch in IcebergTable.applyPuffinStats(), the caller of this class. I think 
it's better to handle it there and this class simply loads and forwards the 
information it reads from Puffin files, it does not modify it.


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@70
PS6, Line 70: }
            :       return result;
            :     } catch (Exception e) {
            :       // We add a catch-all clause here because Puffin stats 
loading shouldn't be an
> You could put the try-catch inside the for-loop.
This CATCH should normally never be triggered, but I added it in case an 
unexpected exception is thrown somewhere. The query shouldn't fail because of 
Puffin stats loading, this is a last guard against unexpected exceptions. 
That's why it encompasses the whole for loop.

You're right that it is inconsistent then that 'result' is returned even in 
case of an exception. I changed that so now we return an empty map in that case.


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@86
PS6, Line 86:  file. If we run into an error reading the cont
> Why don't you import org.apache.iceberg.exceptions.NotFoundException?
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@94
PS6, Line 94: 'UncheckedIOExcepti
> Why don't you import java.io.IOException?
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@84
PS6, Line 84:
            :     // Keep track of the Iceberg column field ids for which we 
read statistics from this
            :     // Puffin file. If we run into an error reading the contents 
of the file, the file may
            :     // be corrupt so we want to remove values already read from 
it from the overall
            :     // result.
            :     List<Integer> fieldIdsFromFile = new ArrayList<>();
            :     try {
            :       PuffinReader puffinReader = createPuffinReader(statsFile);
            :       List<BlobMetadata> blobs = getBlobs(puffinReader, 
currentSnapshotId);
            :
            :       // The 'UncheckedIOException' can be thrown from the 
'next()' method of the
            :       // iterator. Statistics that are loaded successfully before 
an exception is thrown
            :       // are discarded because the file is probably corrupt.
            :       for (Pair<BlobMetadata, ByteBuffer> puffinData: 
puffinReader.readAll(blobs)) {
            :         BlobMetadata blobMetadata = puffinData.first();
            :         ByteBuffer blobData = puffinData.second();
            :
            :         loadStatsFromBlob(blobMetadata, blobData, iceTable, 
statsFile,
            :             result, fieldIdsFromFile);
            :       }
            :     } cat
> I think the code would be more readable if you'd combine the different try-
Done, left two CATCH branches because the 'fileMissing' parameter to the 
logWarning() function is different in the two cases.


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@150
PS6, Line 150:
> nit: you could put this to the next line, as we usually don't break line in
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@166
PS6, Line 166:   private static byte[] getBytes(ByteBuffer byteBuffer) {
> It could be a format string.
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@173
PS6, Line 173:
> At some places there's ', here I see `. I think it would be better to be co
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/pom.xml
File java/puffin-data-generator/pom.xml:

http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/pom.xml@1
PS6, Line 1: <?xml version="1.0"?>
> nit: this project could be added as a module to the parent pom
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/pom.xml@30
PS6, Line 30:   <artifactId>impala-puffin-data-generator</artifactId>
> Ther artifactId clashes with datagenerator subproject
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/pom.xml@77
PS6, Line 77:       <version>${datasketches.version}</version>
> See comment on fe/pom.xml:446
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java
File 
java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java:

http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@18
PS6, Line 18:
> nit: package names are looking better without underscores
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@35
PS6, Line 35:
> unused imports on L41, L43-45, L47
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@131
PS6, Line 131:
> sequneceNumber is always 0, it can be omitted.
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@301
PS6, Line 301:
> There could be one test case where all puffin files are corrupt.
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/java/puffin-data-generator/src/main/java/org/apache/impala/puffin_data_generator/PuffinDataGenerator.java@396
PS6, Line 396:
> IOException is more general, FileNotFoundException and UnsupportedEncodingE
Done


http://gerrit.cloudera.org:8080/#/c/21605/6/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/21605/6/tests/query_test/test_iceberg.py@1807
PS6, Line 1807:         string_col STRING,
> Using
Done



--
To view, visit http://gerrit.cloudera.org:8080/21605
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I50c1228988960a686d08a9b2942e01e366678866
Gerrit-Change-Number: 21605
Gerrit-PatchSet: 8
Gerrit-Owner: Daniel Becker <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 22 Aug 2024 14:20:15 +0000
Gerrit-HasComments: Yes

Reply via email to