[ https://issues.apache.org/jira/browse/FLINK-32230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776551#comment-17776551 ]
Danny Cranmer commented on FLINK-32230: --------------------------------------- Merged commit [{{37975ca}}|https://github.com/apache/flink-connector-aws/commit/37975ca22b088a61ccdc2ed6095e0e1455b5744d] into apache:v3.0 > Deadlock in AWS Kinesis Data Streams AsyncSink connector > -------------------------------------------------------- > > Key: FLINK-32230 > URL: https://issues.apache.org/jira/browse/FLINK-32230 > Project: Flink > Issue Type: Bug > Components: Connectors / AWS > Affects Versions: 1.15.4, 1.16.2, 1.17.1 > Reporter: Antonio Vespoli > Assignee: Antonio Vespoli > Priority: Major > Labels: pull-request-available > Fix For: aws-connector-3.1.0, aws-connector-4.2.0 > > > Connector calls to AWS Kinesis Data Streams can hang indefinitely without > making any progress. > We suspect the root cause to be related to the SDK handling of exceptions, > similarly to what observed in FLINK-31675. > We identified this deadlock on applications running on AWS Kinesis Data > Analytics using the AWS Kinesis Data Streams AsyncSink (with AWS SDK version > 2.20.32 as per FLINK-31675). The deadlock scenario is still the same as > described in FLINK-31675. However, the Netty content-length exception does > not appear when using the updated SDK version. > This issue only occurs for applications and streams in the AWS regions > _ap-northeast-3_ and {_}us-gov-east-1{_}. We did not observe this issue in > any other AWS region. > The issue happens sporadically and unpredictably. As per its nature, we do > not have instructions for reproducing it. > Example of flame-graphs observed when the issue occurs: > {code:java} > root > java.lang.Thread.run:829 > org.apache.flink.runtime.taskmanager.Task.run:568 > org.apache.flink.runtime.taskmanager.Task.doRun:746 > org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke:932 > org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring:953 > org.apache.flink.runtime.taskmanager.Task$$Lambda$1253/0x0000000800ecbc40.run:-1 > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke:753 > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop:804 > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop:203 > org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$907/0x0000000800bf7840.runDefaultAction:-1 > org.apache.flink.streaming.runtime.tasks.StreamTask.processInput:519 > org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput:65 > org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext:110 > org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext:159 > org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent:181 > org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier:231 > org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState:262 > org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$1586/0x00000008012c5c40.apply:-1 > org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2:234 > org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived:66 > org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint:74 > org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint:493 > org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100:64 > org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint:287 > org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint:147 > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier:1198 > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint:1241 > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing:50 > org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1453/0x000000080128e840.run:-1 > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12:1253 > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState:300 > org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier:89 > org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.prepareSnapshotPreBarrier:165 > org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.flush:494 > org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.yieldIfThereExistsInFlightRequests:503 > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield:84 > org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take:149 > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await:2211 > java.util.concurrent.locks.LockSupport.parkNanos:234 > jdk.internal.misc.Unsafe.park:-2 {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)