nsivabalan opened a new pull request, #18421:
URL: https://github.com/apache/hudi/pull/18421

   ### Describe the issue this Pull Request addresses
   
   This PR introduces rolling extra metadata support to automatically carry 
forward configured metadata keys across commits. This solves the problem of 
tracking checkpoint information (e.g., Kafka offsets, Flink checkpoint IDs) in 
streaming ingestion scenarios where:
     - Users need to access checkpoint metadata from the latest commit without 
walking back the timeline
     - Timeline archival can remove old commits containing important checkpoint 
information
     - Table services (compaction, clustering) need to preserve checkpoint 
metadata automatically
   
     Motivation: In streaming ingestion pipelines, checkpoint information is 
critical for exactly-once processing and failure recovery. Currently, users 
must manually track and pass this metadata with every commit, or walk back the 
timeline to find the latest checkpoint. This PR automates this process by 
allowing users to configure specific metadata keys that are automatically 
carried forward to every subsequent commit.
   
   ### Summary and Changelog
   
   Summary:
   Users can now configure specific extra metadata keys (e.g., 
checkpoint.offset, checkpoint.partition) that Hudi will automatically carry 
forward across all commits. When a new commit is created, Hudi checks recent 
commits for these configured keys and merges them into the current commit's 
metadata. This ensures checkpoint information is always available in the latest 
commit without manual intervention.
   
     Detailed Changelog:
   
     Configuration Changes:
     - Added hoodie.write.rolling.metadata.keys (advanced, default: empty) - 
Comma-separated list of metadata keys to automatically roll forward
     - Added hoodie.write.rolling.metadata.timeline.lookback.commits (advanced, 
default: 10) - Maximum number of recent commits to search when looking for 
missing rolling metadata keys
   
     Implementation Changes:
     - Added BaseHoodieWriteClient.mergeRollingMetadata() method that:
       - Executes within the transaction lock after write conflict resolution 
in preCommit()
       - Walks back timeline in reverse order (most recent first) to find 
latest values for configured keys
       - Merges found values into current commit's extra metadata
       - Uses fresh timeline view (either from createTable() or reloaded during 
conflict resolution)
       - Skips metadata table (applies only to data tables)
       - Stops early once all keys are found (performance optimization)
       - Errors in rolling metadata merge do not fail the commit (non-blocking)
     - Added getter methods getRollingMetadataKeys() and 
getRollingMetadataTimelineLookbackCommits() in HoodieWriteConfig
     - Added builder methods withRollingMetadataKeys() and 
withRollingMetadataTimelineLookbackCommits() in HoodieWriteConfig.Builder
   
     Testing Changes:
     - Added comprehensive test suite TestRollingMetadata with 8 test cases 
covering:
       - Basic carry-forward behavior across multiple commits
       - Override semantics (new values take precedence over old)
       - Timeline walkback to find keys from older commits
       - Walkback limit enforcement
       - Behavior with no configured keys (default)
       - First commit in empty table (no rolling metadata available)
       - CoW and MoR table types
       - Empty/whitespace key filtering
   
     Semantics:
     - Current commit values override old values (latest wins)
     - Missing keys are searched in recent commits (reverse chronological order)
     - Applies to commit, deltacommit, and replacecommit action types
     - Works for both CoW and MoR tables
   
   ### Impact
   
     Public API Changes:
     - New writer configuration properties:
       - hoodie.write.rolling.metadata.keys
       - hoodie.write.rolling.metadata.timeline.lookback.commits
     - New builder methods in HoodieWriteConfig.Builder:
       - withRollingMetadataKeys(String keys)
       - withRollingMetadataTimelineLookbackCommits(int lookbackCommits)
   
     User-Facing Changes:
     - Opt-in feature: Users can configure specific metadata keys to 
automatically roll forward across commits
     - Checkpoint information is automatically preserved across table services 
(compaction, clustering, etc.)
     - Latest commit always contains configured rolling metadata keys (within 
lookback window)
   
     Performance Impact:
     - Minimal overhead: Only walks back timeline when keys are missing from 
current commit
     - Early termination: Stops walking back once all keys are found
     - Default lookback limit of 10 commits balances resilience vs. performance
     - Reverse iteration checks most recent commits first for optimal cache 
locality
   
     Breaking Changes:
     None. This is an opt-in feature with default empty configuration 
maintaining existing behavior.
   
   ### Risk Level
   
   low
   
   ### Documentation Update
   
    Config Documentation:
     Both new configurations include comprehensive inline documentation 
explaining:
     - Purpose and use cases
     - Default values and behavior
     - Performance implications
     - Interaction with other features
     
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to