pnowojski commented on code in PR #19993:
URL: https://github.com/apache/flink/pull/19993#discussion_r899887457


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriteRequest.java:
##########
@@ -109,6 +112,9 @@ static ChannelStateWriteRequest buildFutureWriteRequest(
                     }
                 },
                 throwable -> {
+                    if (!dataFuture.isDone()) {
+                        return;
+                    }

Review Comment:
   I agree with @zentol  that this doesn't look good and I would be afraid it 
could lead to some resource leaks.
   
   > Why is the dataFuture not being completed in the first place? Isn't that 
the real issue?
   
   It looks to me like the issue is that `dataFuture` is being cancelled from 
the chain: `PipelinedSubpartition#release()` <- ... <- 
`ResultPartition#release` <- ... <- `NettyShuffleEnvironment#close`. Which 
happens after `StreamTask#cleanUp` (which is waiting for this future to 
complete), leading to a deadlock.
   
   We would either need to cancel the future sooner (`StreamTask#cleanUp`?)`, 
or do what @zentol proposed. I think the latter is indeed a good option. We 
don't need to blockingly wait. Let's just not completely ignore exceptions 
here. Logging error should be fine.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to