nsivabalan opened a new pull request, #18353:
URL: https://github.com/apache/hudi/pull/18353
### Describe the issue this Pull Request addresses
For Record Level Index (RLI), the default of 10 or 1 file groups might
mis-align w/ actual table size. The issue is that metadata table and RLI
initialization happens **before the first commit completes**, so Hudi cannot
assess the actual table size to determine an appropriate number of file
groups. This results in:
**Large bootstrap scenarios**: The hardcoded default may be too small for
tables initialized with large data loads
### Summary and Changelog
**What users gain:**
- **Optimized RLI file group allocation**: For fresh tables, RLI
initialization is now deferred until after the first commit, allowing
Hudi to programmatically determine the optimal number of file groups based
on actual data size
- **Better resource utilization**: Small tables will use fewer file groups
(as low as 1), while large bootstrap scenarios will allocate
more file groups appropriately within configured min/max bounds
- **Transparent optimization**: No user configuration changes needed - the
deferral happens automatically
**Detailed Changes:**
- **HoodieBackedTableMetadataWriter.java (lines 454-456)**: Added logic to
defer RLI initialization for fresh tables (tables with zero
completed instants) by removing `RECORD_INDEX` from
`enabledPartitionTypes` during the first commit
- **TestHoodieBackedMetadata.java**: Added two comprehensive tests:
- `testPartitionedRecordIndexDeferredInitializationForFreshTable`:
Validates RLI is NOT initialized on 1st commit but IS initialized
on 2nd commit with file group count = 1 for small data (150 records)
- `testPartitionedRecordIndexLargerDataFileGroupCount`: Validates that
with larger data (7000 records), file group count is
programmatically determined (> 1) based on the `estimateFileGroupCount`
logic
**How it works:**
1. On **first commit** (fresh table): RLI initialization is skipped even
if enabled in config
2. On **second commit**: RLI initialization proceeds normally, and
`estimateFileGroupCount()` uses the actual record count from the
first commit to determine file groups within the configured min/max bounds
### Impact
**User-facing changes:**
- **Behavioral change (non-breaking)**: For new tables with RLI enabled,
the RLI partition will not be available after the first commit,
but will be available starting from the second commit
- **Performance improvement**: Reduced overhead for small tables, better
scaling for large bootstrap scenarios
- **No config changes needed**: Existing configurations continue to work;
the optimization is automatic
**Performance impact:**
- Small tables (< 1000 records): Expected reduction from 10 file groups to
1-2 file groups, reducing metadata table overhead
- Large bootstrap tables (> 100K records): Better distribution across more
file groups within max bounds
### Risk Level
low
### Documentation Update
<!-- Describe any necessary documentation update if there is any new
feature, config, or user-facing change. If not, put "none".
- The config description must be updated if new configs are added or the
default value of the configs are changed.
- Any new feature or user-facing change requires updating the Hudi website.
Please follow the
[instruction](https://hudi.apache.org/contribute/developer-setup#website)
to make changes to the website. -->
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]