nandakumar131 commented on code in PR #8637:
URL: https://github.com/apache/ozone/pull/8637#discussion_r2176572567


##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java:
##########
@@ -905,7 +905,37 @@ public void snapshotLimitCheck() throws IOException, 
OMException {
   }
 
   public void decrementInFlightSnapshotCount() {
-    inFlightSnapshotCount.decrementAndGet();
+    /*
+     * There is a race condition here because of which
+     * inFlightSnapshotCount could become negative sometimes.
+     *
+     * When we get a call back from Ratis on notifyLeaderReady,
+     * we assume that all the pending transactions (from RaftLog)
+     * are applied to the SateMachine and we go a head and reset the
+     * inFlightSnapshotCount to 0. The expectation here is that after
+     * the leader election we want to start inFlightSnapshotCount from 0.
+     *
+     * The applyTransaction in OzoneManagerStateMachine processes
+     * the calls in an async manner and returns a CompletableFuture,
+     * the transactions are not yet fully processed by the
+     * OzoneManagerStateMachine when Ratis notifies leader ready. (RATIS-2313)
+     *
+     * Because of the async processing in OzoneManagerStateMachine and
+     * Ratis not waiting for the CompletableFuture to complete, we mark the
+     * leader as ready (reset inFlightSnapshotCount to 0) even before we 
complete
+     * processing all the pending transactions from the old term. If there is a
+     * create snapshot transaction in the Raft Log from old term and it gets
+     * processed after we reset inFlightSnapshotCount, the count would become 
-1
+     * as we decrement the inFlightSnapshotCount when the 
OMSnapshotCreateRequest
+     * processing is completed.
+     *
+     * The workaround here is to make sure that we don't make 
inFlightSnapshotCount
+     * negative. We should be able to remove the workaround after RATIS-2313.
+     */
+    int result = inFlightSnapshotCount.decrementAndGet();
+    if (result < 0) {
+      resetInFlightSnapshotCount();
+    }

Review Comment:
   Created [HDDS-13357](https://issues.apache.org/jira/browse/HDDS-13357) to 
fix `resetInFlightSnapshotCount` logic. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to