Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21605 )

Change subject: IMPALA-13247: Support Reading Puffin files for the current 
snapshot
......................................................................


Patch Set 24:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/21605/24/be/src/common/global-flags.cc
File be/src/common/global-flags.cc:

http://gerrit.cloudera.org:8080/#/c/21605/24/be/src/common/global-flags.cc@432
PS24, Line 432: false
As we discussed offline, I would vote for disabling this feature by default, as 
in its current state it is not intuitive to use. As Puffin NDVs are only used 
if they belong to the latest snapshot, another insert from any engine will make 
them invalid, which is different than how COMPUTE STATS results are used, which 
are also accepted when stale.


http://gerrit.cloudera.org:8080/#/c/21605/24/be/src/common/global-flags.cc@433
PS24, Line 433:     "Impala will not read Iceberg Puffin stats files.");
I disagree with putting all flags that are used only in Java in global-flags.cc 
(though others use it like this)- IMO this file should include flags that are 
truly global and affect both c++ and java. The flag could be also defined in 
backend-gflag-util.cc or catalog.cc.


http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@430
PS24, Line 430:
I think that it would be great to have comment that explains the current usage 
of stats in Iceberg (for example a class comment in this file + other places 
could refer to it).

My understanding is that stats come from 3 places:
1. numRows: either written by Iceberg or by COMPUTE STATS, (takes account of 
deleted rows only in the latter case?)
2. HMS column stats: used even if stale
3. NDV from Puffin: overrides HMS stats if from the latest snapshot

As Puffin only tells NDV, it is possible that at a given point NDV is from 
Puffin but other column stats e.g. num nulls come from HMS and is based on a 
much older state of the table.


http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java
File fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java:

http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@60
PS24, Line 60:       this.blob = blob;
Is it actually useful to store the blob? Removing it from PuffinStatsRecord 
should allow releasing its memory earlier.


http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/util/IcebergUtil.java
File fe/src/main/java/org/apache/impala/util/IcebergUtil.java:

http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/util/IcebergUtil.java@1315
PS24, Line 1315:   public static boolean applyPuffinStats(FeIcebergTable iceTbl,
Can you add some comments, for example:
- returns true if stats for the current snapshot were found
- only overrides NDV

Also, I think that the name of the function could indicate that it only changes 
NDV, e.g. applyPuffinNdvStats


http://gerrit.cloudera.org:8080/#/c/21605/24/java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java
File 
java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java:

http://gerrit.cloudera.org:8080/#/c/21605/24/java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java@76
PS24, Line 76: CREATE_TBL
Is this used anywhere? If it is just here as extra information, then it could 
be a comment.


http://gerrit.cloudera.org:8080/#/c/21605/24/java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java@93
PS24, Line 93: Arrays.asLis
Maybe initialize this in a loop instead of one by one?



--
To view, visit http://gerrit.cloudera.org:8080/21605
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I50c1228988960a686d08a9b2942e01e366678866
Gerrit-Change-Number: 21605
Gerrit-PatchSet: 24
Gerrit-Owner: Daniel Becker <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 09 Oct 2024 13:33:04 +0000
Gerrit-HasComments: Yes

Reply via email to