Keith Lee created FLINK-37949:
---------------------------------

             Summary: FlinkKinesisConsumer on EFO mode runs into Netty deadlock 
on client close during stop-with-savepoint 
                 Key: FLINK-37949
                 URL: https://issues.apache.org/jira/browse/FLINK-37949
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Kinesis
    Affects Versions: 1.20.0
            Reporter: Keith Lee


We ran into the following issue

* Flink job was continuously failing over and stop-with-savepoint savepoint was 
failing.
* Savepoint was failing because task failed
* Task failed because of TimeoutException thrown from netty client when Kinesis 
connector attempts to close netty client during Stop-with-savepoint operation
{quote}    java.lang.RuntimeException: java.util.concurrent.TimeoutException
         at 
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.AwaitCloseChannelPoolMap.close(AwaitCloseChannelPoolMap.java:177)
         at 
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.runAndLogError(NettyUtils.java:386)
         at 
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.NettyNioAsyncHttpClient.close(NettyNioAsyncHttpClient.java:198)
         at 
org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxyAsyncV2.close(KinesisProxyAsyncV2.java:72)
         at 
org.apache.flink.streaming.connectors.kinesis.internals.publisher.fanout.FanOutRecordPublisherFactory.close(FanOutRecordPublisherFactory.java:101)
         at 
org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.closeRecordPublisherFactory(KinesisDataFetcher.java:839)
         at 
org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.shutdownFetcher(KinesisDataFetcher.java:813)
         at 
org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.cancel(FlinkKinesisConsumer.java:422)
         at 
org.apache.flink.streaming.api.operators.StreamSource.stop(StreamSource.java:131)
         at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.stopOperatorForStopWithSavepoint(SourceStreamTask.java:311)
    ...
    Caused by: java.util.concurrent.TimeoutException
         at 
java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
         at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
         at 
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.AwaitCloseChannelPoolMap.close(AwaitCloseChannelPoolMap.java:172)
    ...{quote}
*  Netty client failed to close probably due to deadlock. See exception thrown 
by checkDeadLock() within netty
{quote}    
org.apache.flink.kinesis.shaded.io.netty.util.concurrent.BlockingOperationException:
 DefaultChannelPromise@731d6e17(incomplete)
         at 
org.apache.flink.kinesis.shaded.io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:463)
         at 
org.apache.flink.kinesis.shaded.io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:159)
         at 
org.apache.flink.kinesis.shaded.io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:269)
         at 
org.apache.flink.kinesis.shaded.io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:137)
         at 
org.apache.flink.kinesis.shaded.io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:30)
         at 
org.apache.flink.kinesis.shaded.io.netty.channel.pool.SimpleChannelPool.close(SimpleChannelPool.java:408)
         at 
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.BetterSimpleChannelPool.close(BetterSimpleChannelPool.java:38)
         at 
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.HonorCloseOnReleaseChannelPool.close(HonorCloseOnReleaseChannelPool.java:80)
         at 
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.http2.Http2MultiplexedChannelPool.lambda$doClose$11(Http2MultiplexedChannelPool.java:419)
    ...{quote}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to