Daniel Becker created IMPALA-13609:
--------------------------------------
Summary: Store Iceberg snapshot id for COMPUTE STATS
Key: IMPALA-13609
URL: https://issues.apache.org/jira/browse/IMPALA-13609
Project: IMPALA
Issue Type: Improvement
Reporter: Daniel Becker
Assignee: Daniel Becker
Currently, when COMPUTE STATS is run from Impala, we set the
'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on the
other hand, store the snapshot id for which stats were calculated. Although it
is possible to retrieve the timestamp of a snapshot, comparing these two values
is error-prone, e.g. in the following situation
* COMPUTE STATS calculation is running on Snapshot N
* Snapshot N+1 is committed at time T
* COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time T +
Delta
* Some engine writes Puffin statistics for Snapshot N+1
After this, HMS stats will appear to be more recent even though it was
calculated on Snapshot N, while we have Puffin stats for Snapshot N+1.
To resolve this, COMPUTE STATS could set a new table property, e.g.
'impala.computeStatsSnapshotId'.
On the other hand, COMPUTE STATS could be set to calculate stats for only a
subset of the columns, and then a different subset in a subsequent run. The
recency of the stats will then be different for each column. We could consider
storing the snapshot id on a per column basis.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]