This is an automated email from the ASF dual-hosted git repository.
jshao pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/gravitino.git
The following commit(s) were added to refs/heads/main by this push:
new 3532ed965d [#10357] docs(table-maintenance-service): improve optimizer
docs and architecture workflow (#10356)
3532ed965d is described below
commit 3532ed965df04db419ce057e2aa904b2d087c3d7
Author: FANNG <[email protected]>
AuthorDate: Wed Mar 11 15:02:22 2026 +0900
[#10357] docs(table-maintenance-service): improve optimizer docs and
architecture workflow (#10356)
### What changes were proposed in this pull request?
This PR improves optimizer documentation under
`docs/table-maintenance-service/` and adds an architecture diagram
asset.
Main updates:
- Improved quick start prerequisites and end-to-end verification flow.
- Clarified optimizer configuration notes, especially status polling
interval behavior in local verification.
- Improved CLI reference around metrics naming and JDBC metrics driver
classpath requirements.
- Expanded troubleshooting for common runtime and configuration issues.
- Added explicit alpha-stage scope/limitations and extensibility
direction in optimizer overview.
- Added optimizer architecture workflow image and linked it from the
optimizer overview doc.
### Why are the changes needed?
Users can get blocked when following optimizer docs in real environments
because some prerequisites, sequencing, and verification checkpoints
were not explicit enough.
This PR makes the documentation more executable and reproducible, and
reduces setup/troubleshooting friction for the end-to-end flow.
Fix: #10357
### Does this PR introduce _any_ user-facing change?
Yes, documentation-only user-facing changes:
- Updated optimizer docs content and workflow guidance.
- Added one new docs asset image:
-
`docs/assets/table-maintenance-service/optimizer-architecture-workflow.png`
No user-facing API change and no runtime property key addition/removal
in code.
### How was this patch tested?
- Verified markdown links and file paths in updated docs.
- Confirmed the new architecture image is referenced correctly from:
- `docs/table-maintenance-service/optimizer.md`
- Verified this is docs-only change set (no runtime code changes).
---
.../optimizer-architecture-workflow.png | Bin 0 -> 220473 bytes
.../optimizer-cli-reference.md | 7 +++
.../optimizer-configuration.md | 3 +-
.../optimizer-quick-start.md | 15 +++---
.../optimizer-troubleshooting.md | 57 ++++++++++++++++++++-
docs/table-maintenance-service/optimizer.md | 29 +++++++++++
6 files changed, 103 insertions(+), 8 deletions(-)
diff --git
a/docs/assets/table-maintenance-service/optimizer-architecture-workflow.png
b/docs/assets/table-maintenance-service/optimizer-architecture-workflow.png
new file mode 100644
index 0000000000..8b60984c6e
Binary files /dev/null and
b/docs/assets/table-maintenance-service/optimizer-architecture-workflow.png
differ
diff --git a/docs/table-maintenance-service/optimizer-cli-reference.md
b/docs/table-maintenance-service/optimizer-cli-reference.md
index 6df4e44696..7d5213b38f 100644
--- a/docs/table-maintenance-service/optimizer-cli-reference.md
+++ b/docs/table-maintenance-service/optimizer-cli-reference.md
@@ -150,6 +150,10 @@ Rule format is `scope:metricName:aggregation:comparison`:
- `aggregation`: `max|min|avg|latest`
- `comparison`: `lt|le|gt|ge|eq|ne`
+When metrics are produced by `submit-update-stats-job --update-mode metrics`,
metric names are
+often `custom-*` (for example `custom-data-file-mse`). Use
`list-table-metrics` first and
+configure rules with the exact metric names returned by your environment.
+
### Submit built-in update stats jobs
Submit built-in Iceberg update stats/metrics Spark jobs directly.
@@ -168,6 +172,9 @@ Notes:
- `--identifiers` supports `catalog.schema.table` or `schema.table` (when
default catalog is configured).
- `--update-mode` supports `stats|metrics|all` (default `all`).
- For `stats` or `all`, `--updater-options` must include `gravitino_uri` and
`metalake`.
+- If `--updater-options` includes external JDBC metrics settings
+ (`gravitino.optimizer.jdbcMetrics.*`), ensure the JDBC driver JAR is
available to Spark
+ runtime classpath (for example via `spark.jars` in `--spark-conf`).
- `--spark-conf` and `--updater-options` are flat JSON maps.
### List table metrics
diff --git a/docs/table-maintenance-service/optimizer-configuration.md
b/docs/table-maintenance-service/optimizer-configuration.md
index bb67584452..5ee0ac9d1f 100644
--- a/docs/table-maintenance-service/optimizer-configuration.md
+++ b/docs/table-maintenance-service/optimizer-configuration.md
@@ -25,7 +25,8 @@ gravitino.job.statusPullIntervalInMs=300000
gravitino.jobExecutor.local.sparkHome=/path/to/spark
```
-For local demo environments, you can reduce
`gravitino.job.statusPullIntervalInMs` to get faster status updates.
+For local demo environments, you can reduce
`gravitino.job.statusPullIntervalInMs` (for example
+`10000`) to get faster status updates. Restart Gravitino after changing this
value.
## Built-in update stats `jobConf`
diff --git a/docs/table-maintenance-service/optimizer-quick-start.md
b/docs/table-maintenance-service/optimizer-quick-start.md
index 182402dbcd..1be5fe793d 100644
--- a/docs/table-maintenance-service/optimizer-quick-start.md
+++ b/docs/table-maintenance-service/optimizer-quick-start.md
@@ -10,6 +10,8 @@ license: This software is licensed under the Apache License
version 2.
- Prepare a running Gravitino server.
- Ensure target metalake exists (examples use `test`).
- Configure `SPARK_HOME` or `gravitino.jobExecutor.local.sparkHome` for Spark
templates.
+- For faster status feedback during verification, set
`gravitino.job.statusPullIntervalInMs`
+ to a smaller value (for example `10000`) and restart Gravitino.
- If your Iceberg REST backend is in-memory, avoid restarting it during this
quick start because
restart resets metadata and data files.
@@ -17,7 +19,8 @@ For full config details, see [Optimizer
Configuration](./optimizer-configuration
## Success criteria
-- Update-stats job finishes and statistics include `custom-data-file-mse` and
`custom-delete-file-number`.
+- Update-stats job finishes and table statistics/metrics include
`custom-data-file-mse` and
+ `custom-delete-file-number`.
- `submit-strategy-jobs` prints `SUBMIT` with a rewrite job ID.
- Rewrite job log shows `Rewritten data files: <N>` where `N > 0` for
non-empty tables.
@@ -98,12 +101,12 @@ Use Spark SQL to create enough small files so compaction
has visible effect:
```bash
${SPARK_HOME}/bin/spark-sql \
--conf spark.hadoop.fs.defaultFS=file:/// \
- --conf spark.sql.catalog.rest_demo=org.apache.iceberg.spark.SparkCatalog \
- --conf spark.sql.catalog.rest_demo.type=rest \
- --conf spark.sql.catalog.rest_demo.uri=http://localhost:9001/iceberg \
- -e "CREATE NAMESPACE IF NOT EXISTS rest_demo.db; \
+ --conf spark.sql.catalog.rest_catalog=org.apache.iceberg.spark.SparkCatalog \
+ --conf spark.sql.catalog.rest_catalog.type=rest \
+ --conf spark.sql.catalog.rest_catalog.uri=http://localhost:9001/iceberg \
+ -e "CREATE NAMESPACE IF NOT EXISTS rest_catalog.db; \
SET spark.sql.files.maxRecordsPerFile=1000; \
- INSERT INTO rest_demo.db.t1 \
+ INSERT INTO rest_catalog.db.t1 \
SELECT id, concat('name_', CAST(id AS STRING)) FROM range(0, 100000);"
```
diff --git a/docs/table-maintenance-service/optimizer-troubleshooting.md
b/docs/table-maintenance-service/optimizer-troubleshooting.md
index bb9879da9d..7901e69d71 100644
--- a/docs/table-maintenance-service/optimizer-troubleshooting.md
+++ b/docs/table-maintenance-service/optimizer-troubleshooting.md
@@ -31,13 +31,27 @@ Check `gravitino.job.statusPullIntervalInMs` and local
staging logs under:
`/tmp/gravitino/jobs/staging/<metalake>/<job-template-name>/<job-id>/error.log`.
+For local verification, reduce `gravitino.job.statusPullIntervalInMs` (for
example `10000`) and
+restart Gravitino so REST status can refresh faster.
+
## `No identifiers matched strategy name ...`
`--strategy-name` must be the policy name (for example
`iceberg_compaction_default`), not the policy type
(`system_iceberg_compaction`) and not the strategy type
(`iceberg-data-compaction`).
## Dry-run returns no `DRY-RUN` or `SUBMIT` lines
-This usually means trigger conditions are not met. For compaction, verify
`custom-data-file-mse` and `custom-delete-file-number` in table statistics are
large enough to satisfy policy rules.
+This usually means trigger conditions are not met. For compaction, verify
+`custom-data-file-mse` and `custom-delete-file-number` in table
statistics/metrics are large
+enough to satisfy policy rules.
+
+## `monitor-metrics` returns `evaluation=false` unexpectedly
+
+Check both rule names and metric samples:
+
+1. Query current metrics first with `list-table-metrics` (and
`--partition-path` for partition scope).
+2. Use the exact metric names returned by your environment in
+ `gravitino.optimizer.monitor.gravitinoMetricsEvaluator.rules`.
+3. Ensure `--action-time` is inside the range where both before and after
samples exist.
## `No StrategyHandler class configured for strategy type ...`
@@ -57,6 +71,47 @@ Set local filesystem explicitly in Spark config:
spark.hadoop.fs.defaultFS=file:///
```
+## Rewrite fails on multi-level partition (`identity + day(...)`)
+
+In release `1.2.0`, rewrite may fail for partition filters combining identity
and day transform
+(for example `PARTITIONED BY (p, days(ts))`) with error:
+
+```text
+Cannot translate Spark expression ... day(cast(ts as date)) ... to data source
filter
+```
+
+How to verify:
+
+1. Check job run status by rewrite job id under
+ `/api/metalakes/<metalake>/jobs/runs/<job-id>`.
+2. Check staging log:
+
`/tmp/gravitino/jobs/staging/<metalake>/builtin-iceberg-rewrite-data-files/<job-id>/error.log`.
+
+Workaround:
+
+- Use identity-only partition compaction path for release `1.2.0`.
+- Keep this failure case as a reproducible regression test for later fix
validation.
+
+Observed compatibility matrix in release `1.2.0` (rewrite path):
+
+- PASS: `p`, `p, c2` (identity-only partition transforms)
+- FAIL: `p, years(ts)`, `p, months(ts)`, `p, days(ts)`, `p, hours(ts)`,
+ `p, truncate(1, c2)`, `p, bucket(8, id)`
+
+## `submit-update-stats-job` fails with JDBC metrics errors
+
+When `--updater-options` includes `gravitino.optimizer.jdbcMetrics.*`, ensure
the JDBC driver is
+available to Spark runtime classpath. Typical failures include
`ClassNotFoundException` for driver
+class or `No suitable driver`.
+
+Example in `--spark-conf`:
+
+```json
+{
+ "spark.jars": "/path/to/postgresql-42.7.4.jar"
+}
+```
+
## `Specified optimizer config file does not exist`
Check your `--conf-path` and file permissions.
diff --git a/docs/table-maintenance-service/optimizer.md
b/docs/table-maintenance-service/optimizer.md
index ed19ead78f..d7c6cda38f 100644
--- a/docs/table-maintenance-service/optimizer.md
+++ b/docs/table-maintenance-service/optimizer.md
@@ -15,6 +15,30 @@ The Table Maintenance Service (Optimizer) automates table
maintenance by connect
The CLI commands and configuration keys use the `optimizer` name.
+## Alpha status and current limitations
+
+The current Table Maintenance Service is in **alpha** stage.
+
+Current limitations:
+
+- It is operated through the optimizer CLI workflow.
+- The built-in maintenance strategy focuses on Iceberg table compaction.
+- Compaction support is currently limited to Iceberg tables with identity
partition transforms.
+
+## Extensibility and roadmap
+
+Although the built-in capability is intentionally narrow in alpha, the
framework is designed for
+extension:
+
+- Integrate external systems by implementing custom providers and adapters.
+- Add new strategies and handlers beyond built-in compaction.
+- Plug in custom metrics, evaluators, and job submitters for different
environments.
+
+See [Optimizer Extension Guide](./optimizer-extension-guide.md) for extension
points.
+
+Future versions will continue improving the out-of-the-box experience and
evolve toward a more
+ready-to-use maintenance service.
+
## Architecture overview
The optimizer workflow is based on six parts:
@@ -26,6 +50,11 @@ The optimizer workflow is based on six parts:
5. Job executor: local or custom backend that runs submitted jobs.
6. Status and logs: REST job state plus local staging logs.
+
+
+The following diagram shows the end-to-end interactions between CLI, Gravitino
server, Spark jobs,
+JDBC metrics repository, and the Recommender/Updater/Monitor modules.
+
Typical data flow:
1. Collect statistics and metrics for target tables.