[
https://issues.apache.org/jira/browse/HUDI-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raymond Xu updated HUDI-3657:
-----------------------------
Sprint: Hudi-Sprint-Mar-21, Hudi-Sprint-Mar-22 (was: Hudi-Sprint-Mar-21)
> Unbound the restriction that clean retain commits must be smaller than
> archive minimum commits
> ----------------------------------------------------------------------------------------------
>
> Key: HUDI-3657
> URL: https://issues.apache.org/jira/browse/HUDI-3657
> Project: Apache Hudi
> Issue Type: Improvement
> Components: core
> Reporter: Danny Chen
> Priority: Blocker
> Fix For: 0.11.0
>
>
> The end-to-end streaming processing is more and more popular around the Flink
> users now, and the most typical application scenario for streaming ingestion
> checkpoint interval is within minutes (1min, 5mins ..). Say user sets up the
> time-interval as 1 minute, and there are about 60 write commits on the
> timeline for one hour.
> {t1, t2, t3, t4 ...t60}
> Now let's consider the very popular streaming read scenario, people want to
> keep the history data for a medium live time(usually 1 day or even 1 week),
> and let's say user configure the cleaning retain commits number as:
> _1(day) * 24 (hours) * 60 (commits of one hour) _= *1440 commits*
> While considering the current cleaning retain commits restriction:
> _num_retain_commits < min_archive_commits_num_
> We must keep at least 1440 commits on the active timeline, that means we have
> at least:
> _1440 * 3 = 4320_
> files on the timeline !!! Which is a pressure to the file IO and the
> metadata scanning (the metadata client). If we do not configure long enough
> retain time commits, the writer may remove the old files and the reader
> encounter {{FileNotFoundException}}.
> So, we may find a way to lift restriction that active timeline commits number
> must be greater than cleaning retain commits.
> One way i can think of is that we remember the last committed cleaning
> instant and only check that when cleaning (suitable for the hours cleaning
> strategy). With num_commits cleaning strategy we may need to scan the archive
> timeline (or metadata table if it is enabled ?)
> Whatever a solution is eagerly needed now !
--
This message was sent by Atlassian Jira
(v8.20.1#820001)