Daniel Becker has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/22339 )
Change subject: IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS ...................................................................... IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS Currently, when COMPUTE STATS is run from Impala, we set the 'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on the other hand, store the snapshot id for which the stats were calculated. Although it is possible to retrieve the timestamp of a snapshot, comparing these two values is error-prone, e.g. in the following situation: - COMPUTE STATS calculation is running on snapshot N - snapshot N+1 is committed at time T - COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time T + Delta - some engine writes Puffin statistics for snapshot N+1 After this, HMS stats will appear to be more recent even though they were calculated on snapshot N, while we have Puffin stats for snapshot N+1. To make comparisons easier, after this change, COMPUTE STATS sets a new table property, 'impala.computeStatsSnapshotIds'. This property stores the snapshot id for which stats have been computed, for each column. It is a comma-separated list of values of the form "fieldId:snapshotId". Storing the snapshot ids on a per-column basis is needed because COMPUTE STATS can be set to calculate stats for only a subset of the columns, and then a different subset in a subsequent run. The recency of the stats will then be different for each column. Storing the Iceberg field ids instead of column names makes the format easier to handle as we do not need to take care of escaping special characters. Tables may have many columns, so to prevent the 'impala.lastComputeStatsTime' table property from becoming too long, it will only include information for 10 columns by default. This can be modified for a table by setting the 'impala.computeStatsSnapshotIdsMaxSize' table property to the appropriate value. If there are stats for more columns than this limit, information about older stats will be discarded. Note that this change does not yet modify how Impala chooses between Puffin and HMS stats: that will be done in a separate change. Testing: - Added tests in iceberg-compute-stats.test. Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7 --- M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M testdata/workloads/functional-query/queries/QueryTest/iceberg-compute-stats.test 3 files changed, 148 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/22339/5 -- To view, visit http://gerrit.cloudera.org:8080/22339 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7 Gerrit-Change-Number: 22339 Gerrit-PatchSet: 5 Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>