nsivabalan commented on PR #6498: URL: https://github.com/apache/hudi/pull/6498#issuecomment-1236228935
got it, thanks. Let me go through the example you have put up and clarify few things. Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan): ``` partition-A/ commit-0 added file1_V1.parquet commit-1. added file1_V2.parquet commit-2 added file1_V3.parquet partition-B/ commit-3 added file2_V1.parquet ``` In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the files created by commit-0 and keep 3 commits. ie. file1_V1.parquet will be cleaned up. But hudi also keeps track of `earliest commit retained` in this case which is commit2. This `earliest commit retained` is the one we will leverage later to do incremental cleaning. For the next cleaning, incremental cleaning will trigger, and will comb through all commits >= `earliest commit retained` i.e. commit2. and so file2_v1 will be deleted this time. and will update the `earliest commit retained` to commit3 now. this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we can't pin point a commit and say everything before that commit can be ignore for future cleaning. and thus incase of KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning. It is in this policy, we might encounter a corner case where, a file group was updated only in Commit 1 and commit and never updated later. and after a long time, had a new version in say commit 100. we need to clean up the first version (assuming KEEP_LATEST_FILE_VERSIONS count is 2). Let me know if this makes sense. or if you still feel, I am missing something, can you elaborate w/ an example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
