pratyakshsharma commented on a change in pull request #3646:
URL: https://github.com/apache/hudi/pull/3646#discussion_r801435126



##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -299,14 +308,24 @@ public CleanPlanner(HoodieEngineContext context, 
HoodieTable<T, I, K, O> hoodieT
             // do not clean up a savepoint data file
             continue;
           }
-          // Dont delete the latest commit and also the last commit before the 
earliest commit we
-          // are retaining
-          // The window of commit retain == max query run time. So a query 
could be running which
-          // still
-          // uses this file.
-          if (fileCommitTime.equals(lastVersion) || 
(fileCommitTime.equals(lastVersionBeforeEarliestCommitToRetain))) {
-            // move on to the next file
-            continue;
+
+          if (policy == HoodieCleaningPolicy.KEEP_LATEST_COMMITS) {
+            // Dont delete the latest commit and also the last commit before 
the earliest commit we
+            // are retaining
+            // The window of commit retain == max query run time. So a query 
could be running which
+            // still
+            // uses this file.
+            if (fileCommitTime.equals(lastVersion) || 
(fileCommitTime.equals(lastVersionBeforeEarliestCommitToRetain))) {
+              // move on to the next file
+              continue;
+            }
+          } else if (policy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
+            // This block corresponds to KEEP_LATEST_BY_HOURS policy
+            // Do not delete the latest commit.
+            if (fileCommitTime.equals(lastVersion)) {

Review comment:
       The cleaner policies are designed to enable the longest running query to 
succeed. Let us take a small example where ingestion is happening every 30 
minutes and `hoodie.cleaner.commits.retained = 2`. This implies the longest 
running query is going to take 1 hour to complete. Now if I maintain a list of 
commits, when commit at index=2 in this list (i.e 3rd commit from the 
beginning) is completed, the query which started when commit at index=0 
completed might still be running.
   
   list of commits ->  0.    1.       2
   time (in mins).  ->  0.    30.    60
   
   Hence to allow this query to finish, we take into account 
`lastVersionBeforeEarliestCommitToRetain` in L318. `lastVersion` in this case 
is going to be commit at index=1. 
   
   With the new policy KEEP_LATEST_BY_HOURS, when 3rd commit is completed, 1st 
commit will be retained automatically by the logic. Hence the code in its 
current form works fine. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to