pratyakshsharma commented on a change in pull request #3646: URL: https://github.com/apache/hudi/pull/3646#discussion_r801435126
########## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java ########## @@ -299,14 +308,24 @@ public CleanPlanner(HoodieEngineContext context, HoodieTable<T, I, K, O> hoodieT // do not clean up a savepoint data file continue; } - // Dont delete the latest commit and also the last commit before the earliest commit we - // are retaining - // The window of commit retain == max query run time. So a query could be running which - // still - // uses this file. - if (fileCommitTime.equals(lastVersion) || (fileCommitTime.equals(lastVersionBeforeEarliestCommitToRetain))) { - // move on to the next file - continue; + + if (policy == HoodieCleaningPolicy.KEEP_LATEST_COMMITS) { + // Dont delete the latest commit and also the last commit before the earliest commit we + // are retaining + // The window of commit retain == max query run time. So a query could be running which + // still + // uses this file. + if (fileCommitTime.equals(lastVersion) || (fileCommitTime.equals(lastVersionBeforeEarliestCommitToRetain))) { + // move on to the next file + continue; + } + } else if (policy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) { + // This block corresponds to KEEP_LATEST_BY_HOURS policy + // Do not delete the latest commit. + if (fileCommitTime.equals(lastVersion)) { Review comment: The cleaner policies are designed to enable the longest running query to succeed. Let us take a small example where ingestion is happening every 30 minutes and `hoodie.cleaner.commits.retained = 2`. This implies the longest running query is going to take 1 hour to complete. Now if I maintain a list of commits, when commit at index=2 in this list (i.e 3rd commit from the beginning) is completed, the query which started when commit at index=0 completed might still be running. list of commits -> 0. 1. 2 time (in mins). -> 0. 30. 60 Hence to allow this query to finish, we take into account `lastVersionBeforeEarliestCommitToRetain` in L318. `lastVersion` in this case is going to be commit at index=1. With the new policy KEEP_LATEST_BY_HOURS, when 3rd commit is completed, 1st commit will be retained automatically by the logic. Hence the code in its current form works fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org