[ https://issues.apache.org/jira/browse/FLINK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias Pohl updated FLINK-32583: ---------------------------------- Release Note: Fixes a bug in the RestClient where making a request after the client was closed returns a future that never completes. (was: Fixes a race condition in the RestClient that could happen when submitting a request while closing the client. There was a small chance that the request submission wouldn't complete.) > RestClient can deadlock if request made after Netty event executor terminated > ----------------------------------------------------------------------------- > > Key: FLINK-32583 > URL: https://issues.apache.org/jira/browse/FLINK-32583 > Project: Flink > Issue Type: Bug > Components: Runtime / REST > Affects Versions: 1.18.0, 1.16.3, 1.17.2 > Reporter: Patrick Lucas > Assignee: Patrick Lucas > Priority: Minor > Labels: pull-request-available > Fix For: 1.18.0, 1.16.3, 1.17.2 > > > The RestClient can deadlock if a request is made after the Netty event > executor has terminated. > This is due to the listener that would resolve the CompletableFuture that is > attached to the ChannelFuture returned by the call to Netty to connect not > being able to run because the executor to run it rejects the execution. > [RestClient.java|https://github.com/apache/flink/blob/release-1.17.1/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L471-L482]: > {code:java} > final ChannelFuture connectFuture = bootstrap.connect(targetAddress, > targetPort); > final CompletableFuture<Channel> channelFuture = new CompletableFuture<>(); > connectFuture.addListener( > (ChannelFuture future) -> { > if (future.isSuccess()) { > channelFuture.complete(future.channel()); > } else { > channelFuture.completeExceptionally(future.cause()); > } > }); > {code} > In this code, the call to {{addListener()}} can fail silently (only logging > to the console), meaning any code waiting on the CompletableFuture returned > by this method will deadlock. > There was some work in Netty around this back in 2015, but it's unclear to me > how this situation is expected to be handled given the discussion and changes > from these issues: > * [https://github.com/netty/netty/issues/3449] (open) > * [https://github.com/netty/netty/pull/3483] (closed) > * [https://github.com/netty/netty/pull/3566] (closed) > * [https://github.com/netty/netty/pull/5087] (merged) > I think a reasonable fix for Flink would be to check the state of > {{connectFuture}} and {{channelFuture}} immediately after the call to > {{addListener()}}, resolving {{channelFuture}} with > {{completeExceptionally()}} if {{connectFuture}} is done and failed and > {{channelFuture}} has not been completed. In the possible race condition > where the listener was attached successfully and the connection fails > instantly, the result is the same, as calls to > {{CompletableFuture#completeExceptionally()}} are idempotent. > A workaround for users of RestClient is to call {{CompletableFuture#get(long > timeout, TimeUnit unit)}} rather than {{#get()}} or {{#join()}} on the > CompletableFutures it returns. However, if the call throws TimeoutException, > the cause of the failure cannot easily be determined. -- This message was sent by Atlassian Jira (v8.20.10#820010)