[ https://issues.apache.org/jira/browse/IMPALA-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Becker resolved IMPALA-13609. ------------------------------------ Resolution: Implemented > Store Iceberg snapshot id for COMPUTE STATS > ------------------------------------------- > > Key: IMPALA-13609 > URL: https://issues.apache.org/jira/browse/IMPALA-13609 > Project: IMPALA > Issue Type: Improvement > Reporter: Daniel Becker > Assignee: Daniel Becker > Priority: Major > > Currently, when COMPUTE STATS is run from Impala, we set the > 'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on the > other hand, store the snapshot id for which stats were calculated. Although > it is possible to retrieve the timestamp of a snapshot, comparing these two > values is error-prone, e.g. in the following situation > * COMPUTE STATS calculation is running on Snapshot N > * Snapshot N+1 is committed at time T > * COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time T + > Delta > * Some engine writes Puffin statistics for Snapshot N+1 > After this, HMS stats will appear to be more recent even though it was > calculated on Snapshot N, while we have Puffin stats for Snapshot N+1. > To resolve this, COMPUTE STATS could set a new table property, e.g. > 'impala.computeStatsSnapshotId'. > On the other hand, COMPUTE STATS could be set to calculate stats for only a > subset of the columns, and then a different subset in a subsequent run. The > recency of the stats will then be different for each column. We could > consider storing the snapshot id on a per column basis. -- This message was sent by Atlassian Jira (v8.20.10#820010)