Daniel Becker has uploaded a new patch set (#19). ( http://gerrit.cloudera.org:8080/22339 )
Change subject: IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS ...................................................................... IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS Currently, when COMPUTE STATS is run from Impala, we set the 'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on the other hand, store the snapshot id for which the stats were calculated. Although it is possible to retrieve the timestamp of a snapshot, comparing these two values is error-prone, e.g. in the following situation: - COMPUTE STATS calculation is running on snapshot N - snapshot N+1 is committed at time T - COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time T + Delta - some engine writes Puffin statistics for snapshot N+1 After this, HMS stats will appear to be more recent even though they were calculated on snapshot N, while we have Puffin stats for snapshot N+1. To make comparisons easier, after this change, COMPUTE STATS sets a new table property, 'impala.computeStatsSnapshotIds'. This property stores the snapshot id for which stats have been computed, for each column. It is a comma-separated list of values of the form "fieldIdRangeStart[-fieldIdRangeEndIncl]:snapshotId". The fieldId part may be a single value or a contiguous, inclusive range. Storing the snapshot ids on a per-column basis is needed because COMPUTE STATS can be set to calculate stats for only a subset of the columns, and then a different subset in a subsequent run. The recency of the stats will then be different for each column. Storing the Iceberg field ids instead of column names makes the format easier to handle as we do not need to take care of escaping special characters. The 'impala.computeStatsSnapshotIds' table property is deleted after DROP STATS. Note that this change does not yet modify how Impala chooses between Puffin and HMS stats: that will be done in a separate change. Testing: - Added tests in iceberg-compute-stats.test checking that 'impala.computeStatsSnapshotIds' is set correctly and is deleted after DROP STATS - added unit tests in IcebergUtilTest.java that check the parsing and serialisation of the table property Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7 --- M be/src/exec/catalog-op-executor.cc M be/src/exec/catalog-op-executor.h M be/src/service/client-request-state.cc M common/thrift/JniCatalog.thrift M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/util/IcebergUtil.java M fe/src/test/java/org/apache/impala/util/IcebergUtilTest.java M testdata/workloads/functional-query/queries/QueryTest/iceberg-compute-stats.test 10 files changed, 327 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/22339/19 -- To view, visit http://gerrit.cloudera.org:8080/22339 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7 Gerrit-Change-Number: 22339 Gerrit-PatchSet: 19 Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>