[jira] [Commented] (IMPALA-13609) Store Iceberg snapshot id for COMPUTE STATS

ASF subversion and git services (Jira) Sat, 05 Apr 2025 10:18:13 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940085#comment-17940085
 ]


ASF subversion and git services commented on IMPALA-13609:
----------------------------------------------------------

Commit d714798904de449cd629df53934d8336bd767512 in impala's branch 
refs/heads/master from Daniel Becker
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d71479890 ]

IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS

Currently, when COMPUTE STATS is run from Impala, we set the
'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on
the other hand, store the snapshot id for which the stats were
calculated. Although it is possible to retrieve the timestamp of a
snapshot, comparing these two values is error-prone, e.g. in the
following situation:

 - COMPUTE STATS calculation is running on snapshot N
 - snapshot N+1 is committed at time T
 - COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time
   T + Delta
 - some engine writes Puffin statistics for snapshot N+1

After this, HMS stats will appear to be more recent even though they
were calculated on snapshot N, while we have Puffin stats for snapshot
N+1.

To make comparisons easier, after this change, COMPUTE STATS sets a new
table property, 'impala.computeStatsSnapshotIds'. This property stores
the snapshot id for which stats have been computed, for each column. It
is a comma-separated list of values of the form
"fieldIdRangeStart[-fieldIdRangeEndIncl]:snapshotId". The fieldId part
may be a single value or a contiguous, inclusive range.

Storing the snapshot ids on a per-column basis is needed because COMPUTE
STATS can be set to calculate stats for only a subset of the columns,
and then a different subset in a subsequent run. The recency of the
stats will then be different for each column.

Storing the Iceberg field ids instead of column names makes the format
easier to handle as we do not need to take care of escaping special
characters.

The 'impala.computeStatsSnapshotIds' table property is deleted after
DROP STATS.

Note that this change does not yet modify how Impala chooses between
Puffin and HMS stats: that will be done in a separate change.

Testing:
 - Added tests in iceberg-compute-stats.test checking that
   'impala.computeStatsSnapshotIds' is set correctly and is deleted
   after DROP STATS
 - added unit tests in IcebergUtilTest.java that check the parsing and
   serialisation of the table property

Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7
Reviewed-on: http://gerrit.cloudera.org:8080/22339
Reviewed-by: Daniel Becker <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Store Iceberg snapshot id for COMPUTE STATS
> -------------------------------------------
>
>                 Key: IMPALA-13609
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13609
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Daniel Becker
>            Assignee: Daniel Becker
>            Priority: Major
>
> Currently, when COMPUTE STATS is run from Impala, we set the 
> 'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on the 
> other hand, store the snapshot id for which stats were calculated. Although 
> it is possible to retrieve the timestamp of a snapshot, comparing these two 
> values is error-prone, e.g. in the following situation
>  * COMPUTE STATS calculation is running on Snapshot N
>  * Snapshot N+1 is committed at time T
>  * COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time T + 
> Delta
>  * Some engine writes Puffin statistics for Snapshot N+1
> After this, HMS stats will appear to be more recent even though it was 
> calculated on Snapshot N, while we have Puffin stats for Snapshot N+1.
> To resolve this, COMPUTE STATS could set a new table property, e.g. 
> 'impala.computeStatsSnapshotId'.
> On the other hand, COMPUTE STATS could be set to calculate stats for only a 
> subset of the columns, and then a different subset in a subsequent run. The 
> recency of the stats will then be different for each column. We could 
> consider storing the snapshot id on a per column basis.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13609) Store Iceberg snapshot id for COMPUTE STATS

Reply via email to