pnowojski commented on code in PR #19993: URL: https://github.com/apache/flink/pull/19993#discussion_r899887457
########## flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriteRequest.java: ########## @@ -109,6 +112,9 @@ static ChannelStateWriteRequest buildFutureWriteRequest( } }, throwable -> { + if (!dataFuture.isDone()) { + return; + } Review Comment: I agree with @zentol that this doesn't look good and I would be afraid it could lead to some resource leaks. It looks to me like the issue is that `dataFuture` is being cancelled from the chain: `PipelinedSubpartition#release()` <- ... <- `ResultPartition#release` <- ... <- `NettyShuffleEnvironment#close`. Which happens after `StreamTask#cleanUp` (which is waiting for this future to complete), leading to a deadlock. We would either need to cancel the future sooner (`StreamTask#cleanUp`?)`, or do what @zentol proposed. I think the latter is indeed a good option. We don't need to blockingly wait. Let's just not completely ignore exceptions here. Logging error should be fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org