nsivabalan opened a new pull request, #18421:
URL: https://github.com/apache/hudi/pull/18421
### Describe the issue this Pull Request addresses
This PR introduces rolling extra metadata support to automatically carry
forward configured metadata keys across commits. This solves the problem of
tracking checkpoint information (e.g., Kafka offsets, Flink checkpoint IDs) in
streaming ingestion scenarios where:
- Users need to access checkpoint metadata from the latest commit without
walking back the timeline
- Timeline archival can remove old commits containing important checkpoint
information
- Table services (compaction, clustering) need to preserve checkpoint
metadata automatically
Motivation: In streaming ingestion pipelines, checkpoint information is
critical for exactly-once processing and failure recovery. Currently, users
must manually track and pass this metadata with every commit, or walk back the
timeline to find the latest checkpoint. This PR automates this process by
allowing users to configure specific metadata keys that are automatically
carried forward to every subsequent commit.
### Summary and Changelog
Summary:
Users can now configure specific extra metadata keys (e.g.,
checkpoint.offset, checkpoint.partition) that Hudi will automatically carry
forward across all commits. When a new commit is created, Hudi checks recent
commits for these configured keys and merges them into the current commit's
metadata. This ensures checkpoint information is always available in the latest
commit without manual intervention.
Detailed Changelog:
Configuration Changes:
- Added hoodie.write.rolling.metadata.keys (advanced, default: empty) -
Comma-separated list of metadata keys to automatically roll forward
- Added hoodie.write.rolling.metadata.timeline.lookback.commits (advanced,
default: 10) - Maximum number of recent commits to search when looking for
missing rolling metadata keys
Implementation Changes:
- Added BaseHoodieWriteClient.mergeRollingMetadata() method that:
- Executes within the transaction lock after write conflict resolution
in preCommit()
- Walks back timeline in reverse order (most recent first) to find
latest values for configured keys
- Merges found values into current commit's extra metadata
- Uses fresh timeline view (either from createTable() or reloaded during
conflict resolution)
- Skips metadata table (applies only to data tables)
- Stops early once all keys are found (performance optimization)
- Errors in rolling metadata merge do not fail the commit (non-blocking)
- Added getter methods getRollingMetadataKeys() and
getRollingMetadataTimelineLookbackCommits() in HoodieWriteConfig
- Added builder methods withRollingMetadataKeys() and
withRollingMetadataTimelineLookbackCommits() in HoodieWriteConfig.Builder
Testing Changes:
- Added comprehensive test suite TestRollingMetadata with 8 test cases
covering:
- Basic carry-forward behavior across multiple commits
- Override semantics (new values take precedence over old)
- Timeline walkback to find keys from older commits
- Walkback limit enforcement
- Behavior with no configured keys (default)
- First commit in empty table (no rolling metadata available)
- CoW and MoR table types
- Empty/whitespace key filtering
Semantics:
- Current commit values override old values (latest wins)
- Missing keys are searched in recent commits (reverse chronological order)
- Applies to commit, deltacommit, and replacecommit action types
- Works for both CoW and MoR tables
### Impact
Public API Changes:
- New writer configuration properties:
- hoodie.write.rolling.metadata.keys
- hoodie.write.rolling.metadata.timeline.lookback.commits
- New builder methods in HoodieWriteConfig.Builder:
- withRollingMetadataKeys(String keys)
- withRollingMetadataTimelineLookbackCommits(int lookbackCommits)
User-Facing Changes:
- Opt-in feature: Users can configure specific metadata keys to
automatically roll forward across commits
- Checkpoint information is automatically preserved across table services
(compaction, clustering, etc.)
- Latest commit always contains configured rolling metadata keys (within
lookback window)
Performance Impact:
- Minimal overhead: Only walks back timeline when keys are missing from
current commit
- Early termination: Stops walking back once all keys are found
- Default lookback limit of 10 commits balances resilience vs. performance
- Reverse iteration checks most recent commits first for optimal cache
locality
Breaking Changes:
None. This is an opt-in feature with default empty configuration
maintaining existing behavior.
### Risk Level
low
### Documentation Update
Config Documentation:
Both new configurations include comprehensive inline documentation
explaining:
- Purpose and use cases
- Default values and behavior
- Performance implications
- Interaction with other features
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]