Keith Lee created FLINK-37949: --------------------------------- Summary: FlinkKinesisConsumer on EFO mode runs into Netty deadlock on client close during stop-with-savepoint Key: FLINK-37949 URL: https://issues.apache.org/jira/browse/FLINK-37949 Project: Flink Issue Type: Bug Components: Connectors / Kinesis Affects Versions: 1.20.0 Reporter: Keith Lee
We ran into the following issue * Flink job was continuously failing over and stop-with-savepoint savepoint was failing. * Savepoint was failing because task failed * Task failed because of TimeoutException thrown from netty client when Kinesis connector attempts to close netty client during Stop-with-savepoint operation {quote} java.lang.RuntimeException: java.util.concurrent.TimeoutException at org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.AwaitCloseChannelPoolMap.close(AwaitCloseChannelPoolMap.java:177) at org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.runAndLogError(NettyUtils.java:386) at org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.NettyNioAsyncHttpClient.close(NettyNioAsyncHttpClient.java:198) at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxyAsyncV2.close(KinesisProxyAsyncV2.java:72) at org.apache.flink.streaming.connectors.kinesis.internals.publisher.fanout.FanOutRecordPublisherFactory.close(FanOutRecordPublisherFactory.java:101) at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.closeRecordPublisherFactory(KinesisDataFetcher.java:839) at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.shutdownFetcher(KinesisDataFetcher.java:813) at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.cancel(FlinkKinesisConsumer.java:422) at org.apache.flink.streaming.api.operators.StreamSource.stop(StreamSource.java:131) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.stopOperatorForStopWithSavepoint(SourceStreamTask.java:311) ... Caused by: java.util.concurrent.TimeoutException at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) at org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.AwaitCloseChannelPoolMap.close(AwaitCloseChannelPoolMap.java:172) ...{quote} * Netty client failed to close probably due to deadlock. See exception thrown by checkDeadLock() within netty {quote} org.apache.flink.kinesis.shaded.io.netty.util.concurrent.BlockingOperationException: DefaultChannelPromise@731d6e17(incomplete) at org.apache.flink.kinesis.shaded.io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:463) at org.apache.flink.kinesis.shaded.io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:159) at org.apache.flink.kinesis.shaded.io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:269) at org.apache.flink.kinesis.shaded.io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:137) at org.apache.flink.kinesis.shaded.io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:30) at org.apache.flink.kinesis.shaded.io.netty.channel.pool.SimpleChannelPool.close(SimpleChannelPool.java:408) at org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.BetterSimpleChannelPool.close(BetterSimpleChannelPool.java:38) at org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.HonorCloseOnReleaseChannelPool.close(HonorCloseOnReleaseChannelPool.java:80) at org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.http2.Http2MultiplexedChannelPool.lambda$doClose$11(Http2MultiplexedChannelPool.java:419) ...{quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)