kbuci opened a new pull request, #18306: URL: https://github.com/apache/hudi/pull/18306
### Describe the issue this Pull Request addresses Currently, log compaction is always scheduled whenever the operation type is `LOG_COMPACT`, regardless of how many delta commits have occurred since the last log compaction. This leads to unnecessary log compaction scheduling, wasting resources when few delta commits (and therefore most likely only a few log files/blocks) have accumulated. ### Summary and Changelog Changes log compaction scheduling to use the `LogCompactionBlocksThreshold` config as a gating threshold. Instead of unconditionally scheduling log compaction, the scheduler now counts delta commits since the last compaction and the last log compaction, takes the minimum of the two, and only schedules log compaction when that count meets or exceeds the threshold. - Added `CompactionUtils.getDeltaCommitsSinceLatestLogCompaction()` which determines the number of delta commits since the most recent completed log compaction by inspecting the raw active timeline (needed because completed log compaction instants transition from `LOG_COMPACTION_ACTION` to `DELTA_COMMIT_ACTION`) - Added `ScheduleCompactionActionExecutor.getDeltaCommitInfoSinceLogCompaction()` which creates a raw active timeline and delegates to the new `CompactionUtils` method - Renamed `getLatestDeltaCommitInfo()` to `getLatestDeltaCommitInfoSinceCompaction()` for clarity - Updated `needCompact()` to replace the unconditional `return true` for `LOG_COMPACT` with threshold-based logic: `Math.min(deltaCommitsSinceCompaction, deltaCommitsSinceLogCompaction) >= logCompactionBlocksThreshold` - Added unit tests for `getDeltaCommitsSinceLatestLogCompaction` covering completed log compaction, no log compaction, and empty timeline cases ### Impact No public API changes. Log compaction will now be scheduled less frequently — only when enough delta commits have accumulated since the last compaction or log compaction to meet the `hoodie.log.compaction.blocks.threshold` (default: 5). This reduces unnecessary log compaction overhead for tables with frequent small writes. ### Risk Level Low. The change only affects log compaction scheduling frequency. Regular compaction scheduling is unchanged. ### Documentation Update None. No new configs are introduced; the existing `hoodie.log.compaction.blocks.threshold` config is now also used to gate scheduling frequency in addition to its existing role in plan generation. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
