Antonio Vespoli created FLINK-32230: ---------------------------------------
Summary: Deadlock in AWS Kinesis Data Streams connector Key: FLINK-32230 URL: https://issues.apache.org/jira/browse/FLINK-32230 Project: Flink Issue Type: Bug Components: Connectors / AWS Affects Versions: 1.17.1, 1.16.2, 1.15.4 Reporter: Antonio Vespoli Fix For: aws-connector-3.1.0, aws-connector-4.2.0 Connector calls to AWS Kinesis Data Streams can hang indefinitely without making any progress. We suspect the root cause to be related to the SDK handling of exceptions, similarly to what observed in FLINK-31675. We identified this deadlock on applications running on AWS Kinesis Data Analytics using the AWS Kinesis Data Streams connectors (with AWS SDK version 2.20.32 as per FLINK-31675). The deadlock scenario is still the same as described in FLINK-31675. However, the Netty content-length exception does not appear when using the updated SDK version. This issue only occurs for applications and streams in the AWS regions _ap-northeast-3_ and {_}us-gov-east-1{_}. We did not observe this issue in any other AWS region. The issue happens sporadically and unpredictably. As per its nature, we do not have instructions for reproducing it. Example of flame-graphs observed when the issue occurs: {code:java} root java.lang.Thread.run:829 org.apache.flink.runtime.taskmanager.Task.run:568 org.apache.flink.runtime.taskmanager.Task.doRun:746 org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke:932 org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring:953 org.apache.flink.runtime.taskmanager.Task$$Lambda$1253/0x0000000800ecbc40.run:-1 org.apache.flink.streaming.runtime.tasks.StreamTask.invoke:753 org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop:804 org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop:203 org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$907/0x0000000800bf7840.runDefaultAction:-1 org.apache.flink.streaming.runtime.tasks.StreamTask.processInput:519 org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput:65 org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext:110 org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext:159 org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent:181 org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier:231 org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState:262 org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$1586/0x00000008012c5c40.apply:-1 org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2:234 org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived:66 org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint:74 org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint:493 org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100:64 org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint:287 org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint:147 org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier:1198 org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint:1241 org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing:50 org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1453/0x000000080128e840.run:-1 org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12:1253 org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState:300 org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier:89 org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.prepareSnapshotPreBarrier:165 org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.flush:494 org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.yieldIfThereExistsInFlightRequests:503 org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield:84 org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take:149 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await:2211 java.util.concurrent.locks.LockSupport.parkNanos:234 jdk.internal.misc.Unsafe.park:-2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)