Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/21605 )
Change subject: IMPALA-13247: Support Reading Puffin files for the current snapshot ...................................................................... Patch Set 24: (7 comments) http://gerrit.cloudera.org:8080/#/c/21605/24/be/src/common/global-flags.cc File be/src/common/global-flags.cc: http://gerrit.cloudera.org:8080/#/c/21605/24/be/src/common/global-flags.cc@432 PS24, Line 432: false As we discussed offline, I would vote for disabling this feature by default, as in its current state it is not intuitive to use. As Puffin NDVs are only used if they belong to the latest snapshot, another insert from any engine will make them invalid, which is different than how COMPUTE STATS results are used, which are also accepted when stale. http://gerrit.cloudera.org:8080/#/c/21605/24/be/src/common/global-flags.cc@433 PS24, Line 433: "Impala will not read Iceberg Puffin stats files."); I disagree with putting all flags that are used only in Java in global-flags.cc (though others use it like this)- IMO this file should include flags that are truly global and affect both c++ and java. The flag could be also defined in backend-gflag-util.cc or catalog.cc. http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java: http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@430 PS24, Line 430: I think that it would be great to have comment that explains the current usage of stats in Iceberg (for example a class comment in this file + other places could refer to it). My understanding is that stats come from 3 places: 1. numRows: either written by Iceberg or by COMPUTE STATS, (takes account of deleted rows only in the latter case?) 2. HMS column stats: used even if stale 3. NDV from Puffin: overrides HMS stats if from the latest snapshot As Puffin only tells NDV, it is possible that at a given point NDV is from Puffin but other column stats e.g. num nulls come from HMS and is based on a much older state of the table. http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java File fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java: http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java@60 PS24, Line 60: this.blob = blob; Is it actually useful to store the blob? Removing it from PuffinStatsRecord should allow releasing its memory earlier. http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/util/IcebergUtil.java File fe/src/main/java/org/apache/impala/util/IcebergUtil.java: http://gerrit.cloudera.org:8080/#/c/21605/24/fe/src/main/java/org/apache/impala/util/IcebergUtil.java@1315 PS24, Line 1315: public static boolean applyPuffinStats(FeIcebergTable iceTbl, Can you add some comments, for example: - returns true if stats for the current snapshot were found - only overrides NDV Also, I think that the name of the function could indicate that it only changes NDV, e.g. applyPuffinNdvStats http://gerrit.cloudera.org:8080/#/c/21605/24/java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java File java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java: http://gerrit.cloudera.org:8080/#/c/21605/24/java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java@76 PS24, Line 76: CREATE_TBL Is this used anywhere? If it is just here as extra information, then it could be a comment. http://gerrit.cloudera.org:8080/#/c/21605/24/java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java@93 PS24, Line 93: Arrays.asLis Maybe initialize this in a loop instead of one by one? -- To view, visit http://gerrit.cloudera.org:8080/21605 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I50c1228988960a686d08a9b2942e01e366678866 Gerrit-Change-Number: 21605 Gerrit-PatchSet: 24 Gerrit-Owner: Daniel Becker <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 09 Oct 2024 13:33:04 +0000 Gerrit-HasComments: Yes
