Daniel Becker has uploaded a new patch set (#19). ( 
http://gerrit.cloudera.org:8080/22339 )

Change subject: IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS
......................................................................

IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS

Currently, when COMPUTE STATS is run from Impala, we set the
'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on
the other hand, store the snapshot id for which the stats were
calculated. Although it is possible to retrieve the timestamp of a
snapshot, comparing these two values is error-prone, e.g. in the
following situation:

 - COMPUTE STATS calculation is running on snapshot N
 - snapshot N+1 is committed at time T
 - COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time
   T + Delta
 - some engine writes Puffin statistics for snapshot N+1

After this, HMS stats will appear to be more recent even though they
were calculated on snapshot N, while we have Puffin stats for snapshot
N+1.

To make comparisons easier, after this change, COMPUTE STATS sets a new
table property, 'impala.computeStatsSnapshotIds'. This property stores
the snapshot id for which stats have been computed, for each column. It
is a comma-separated list of values of the form
"fieldIdRangeStart[-fieldIdRangeEndIncl]:snapshotId". The fieldId part
may be a single value or a contiguous, inclusive range.

Storing the snapshot ids on a per-column basis is needed because COMPUTE
STATS can be set to calculate stats for only a subset of the columns,
and then a different subset in a subsequent run. The recency of the
stats will then be different for each column.

Storing the Iceberg field ids instead of column names makes the format
easier to handle as we do not need to take care of escaping special
characters.

The 'impala.computeStatsSnapshotIds' table property is deleted after
DROP STATS.

Note that this change does not yet modify how Impala chooses between
Puffin and HMS stats: that will be done in a separate change.

Testing:
 - Added tests in iceberg-compute-stats.test checking that
   'impala.computeStatsSnapshotIds' is set correctly and is deleted
   after DROP STATS
 - added unit tests in IcebergUtilTest.java that check the parsing and
   serialisation of the table property

Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/catalog-op-executor.h
M be/src/service/client-request-state.cc
M common/thrift/JniCatalog.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/util/IcebergUtil.java
M fe/src/test/java/org/apache/impala/util/IcebergUtilTest.java
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-compute-stats.test
10 files changed, 327 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/22339/19
--
To view, visit http://gerrit.cloudera.org:8080/22339
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7
Gerrit-Change-Number: 22339
Gerrit-PatchSet: 19
Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to