shangxinli opened a new pull request, #13674:
URL: https://github.com/apache/hudi/pull/13674
### Change Logs
This change introduces a new file stitching optimization for Hudi clustering
that merges row groups based on schema compatibility using Parquet API. The
implementation adds HoodieParquetStrictMerge for efficient file merging,
LiteFileBinaryCopier for optimized file copying, and updates the
PartitionAwareClusteringPlanStrategy to support row group-level merging. New
configuration PARQUET_LITE_FILE_MERGER_ENABLE has been added to control this
feature.
### Impact
- New configuration: hoodie.storage.parquet.lite.file.merger.enable
(default: false)
- Enhanced PartitionAwareClusteringPlanStrategy with row group merging
capabilities
### Risk level (medium)
Verification done to mitigate risks:
- Added unit tests in HoodieParquetStrictMergeTest and
TestClusteringLiteFileMerger
- Integration tests for partition-aware clustering strategy
- Feature is disabled by default and requires explicit configuration
- Maintains backward compatibility with existing clustering behavior
### Documentation Update
Required updates:
- Configuration documentation needs update for new
hoodie.storage.parquet.lite.file.merger.enable config
- Clustering strategy documentation should include information about row
group merging optimization
- Performance tuning guide should mention this optimization for large file
scenarios
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]