junrao commented on code in PR #19437:
URL: https://github.com/apache/kafka/pull/19437#discussion_r2049498194


##########
core/src/main/java/kafka/server/share/DelayedShareFetch.java:
##########
@@ -277,9 +323,15 @@ public boolean tryComplete() {
             return false;
         } catch (Exception e) {
             log.error("Error processing delayed share fetch request", e);
-            releasePartitionLocks(topicPartitionData.keySet());
-            partitionsAcquired.clear();
-            partitionsAlreadyFetched.clear();
+            // In case we have a remote fetch exception, we have already 
released locks for partitions which have potential
+            // local log read. We do not release locks for partitions which 
have a remote storage read because we need to

Review Comment:
   @adixitconfluent : I am saying that the code has the potential issue of 
never releasing the share partition lock. This is a bit subtle, but here is a 
possible scenario.
   
   Thread 1 calls `DelayedOperationPurgatory.checkAndComplete()`, which 
eventually will call the following.
   
   ```
       boolean safeTryComplete() {
           lock.lock();
           try {
               if (isCompleted()) return false;
               else return tryComplete();
           } finally {
               lock.unlock();
           }
       }
   ```
   Suppose that `isCompleted()` returns false and thread 1 is just about to 
call tryComplete().
   
   Now, the expiration thread kicks in and calls `run()` in the following. It 
sets `completed` to true and runs through `onComplete()`.
   
   ```
       public void run() {
           if (forceComplete())
               onExpiration();
       }
   
       public boolean forceComplete() {
           if (completed.compareAndSet(false, true)) {
               // cancel the timeout timer
               cancel();
               onComplete();
               return true;
           } else {
               return false;
           }
       }
   ```
   
   Now thread 1 continues in `tryComplete()`. It acquires the share partition 
lock for a remote partition and sets `remoteStorageFetchException`. It then 
calls `forceComplete()`. Since `isCompleted()` is already true, it will return 
false immediately without calling `onComplete()`. So, the acquired share 
partition lock will never be released. The same problem exists for local fetch 
in `tryComplete()` and we have the following code to handle the lock release.
   
   ```
                       boolean completedByMe = forceComplete();
                       // If invocation of forceComplete is not successful, 
then that means the request is already completed
                       // hence release the acquired locks.
                       if (!completedByMe) {
                           releasePartitionLocks(partitionsAcquired.keySet());
                       }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to