[ 
https://issues.apache.org/jira/browse/IMPALA-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Becker resolved IMPALA-13609.
------------------------------------
    Resolution: Implemented

> Store Iceberg snapshot id for COMPUTE STATS
> -------------------------------------------
>
>                 Key: IMPALA-13609
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13609
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Daniel Becker
>            Assignee: Daniel Becker
>            Priority: Major
>
> Currently, when COMPUTE STATS is run from Impala, we set the 
> 'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on the 
> other hand, store the snapshot id for which stats were calculated. Although 
> it is possible to retrieve the timestamp of a snapshot, comparing these two 
> values is error-prone, e.g. in the following situation
>  * COMPUTE STATS calculation is running on Snapshot N
>  * Snapshot N+1 is committed at time T
>  * COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time T + 
> Delta
>  * Some engine writes Puffin statistics for Snapshot N+1
> After this, HMS stats will appear to be more recent even though it was 
> calculated on Snapshot N, while we have Puffin stats for Snapshot N+1.
> To resolve this, COMPUTE STATS could set a new table property, e.g. 
> 'impala.computeStatsSnapshotId'.
> On the other hand, COMPUTE STATS could be set to calculate stats for only a 
> subset of the columns, and then a different subset in a subsequent run. The 
> recency of the stats will then be different for each column. We could 
> consider storing the snapshot id on a per column basis.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to