Daniel Becker created IMPALA-13609:
--------------------------------------

             Summary: Store Iceberg snapshot id for COMPUTE STATS
                 Key: IMPALA-13609
                 URL: https://issues.apache.org/jira/browse/IMPALA-13609
             Project: IMPALA
          Issue Type: Improvement
            Reporter: Daniel Becker
            Assignee: Daniel Becker


Currently, when COMPUTE STATS is run from Impala, we set the 
'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on the 
other hand, store the snapshot id for which stats were calculated. Although it 
is possible to retrieve the timestamp of a snapshot, comparing these two values 
is error-prone, e.g. in the following situation
 * COMPUTE STATS calculation is running on Snapshot N
 * Snapshot N+1 is committed at time T
 * COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time T + 
Delta
 * Some engine writes Puffin statistics for Snapshot N+1

After this, HMS stats will appear to be more recent even though it was 
calculated on Snapshot N, while we have Puffin stats for Snapshot N+1.

To resolve this, COMPUTE STATS could set a new table property, e.g. 
'impala.computeStatsSnapshotId'.

On the other hand, COMPUTE STATS could be set to calculate stats for only a 
subset of the columns, and then a different subset in a subsequent run. The 
recency of the stats will then be different for each column. We could consider 
storing the snapshot id on a per column basis.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to