[ 
https://issues.apache.org/jira/browse/FLINK-32230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Cranmer reassigned FLINK-32230:
-------------------------------------

    Assignee: Antonio Vespoli  (was: Ahmed Hamdy)

> Deadlock in AWS Kinesis Data Streams AsyncSink connector
> --------------------------------------------------------
>
>                 Key: FLINK-32230
>                 URL: https://issues.apache.org/jira/browse/FLINK-32230
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / AWS
>    Affects Versions: 1.15.4, 1.16.2, 1.17.1
>            Reporter: Antonio Vespoli
>            Assignee: Antonio Vespoli
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: aws-connector-3.1.0, aws-connector-4.2.0
>
>
> Connector calls to AWS Kinesis Data Streams can hang indefinitely without 
> making any progress.
> We suspect the root cause to be related to the SDK handling of exceptions, 
> similarly to what observed in FLINK-31675.
> We identified this deadlock on applications running on AWS Kinesis Data 
> Analytics using the AWS Kinesis Data Streams AsyncSink (with AWS SDK version 
> 2.20.32 as per FLINK-31675). The deadlock scenario is still the same as 
> described in FLINK-31675. However, the Netty content-length exception does 
> not appear when using the updated SDK version.
> This issue only occurs for applications and streams in the AWS regions 
> _ap-northeast-3_ and {_}us-gov-east-1{_}. We did not observe this issue in 
> any other AWS region.
> The issue happens sporadically and unpredictably. As per its nature, we do 
> not have instructions for reproducing it.
> Example of flame-graphs observed when the issue occurs:
> {code:java}
> root
> java.lang.Thread.run:829
> org.apache.flink.runtime.taskmanager.Task.run:568
> org.apache.flink.runtime.taskmanager.Task.doRun:746
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke:932
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring:953
> org.apache.flink.runtime.taskmanager.Task$$Lambda$1253/0x0000000800ecbc40.run:-1
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke:753
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop:804
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop:203
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$907/0x0000000800bf7840.runDefaultAction:-1
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput:519
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput:65
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext:110
> org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext:159
> org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent:181
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier:231
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState:262
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$1586/0x00000008012c5c40.apply:-1
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2:234
> org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived:66
> org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint:74
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint:493
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100:64
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint:287
> org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint:147
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier:1198
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint:1241
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing:50
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1453/0x000000080128e840.run:-1
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12:1253
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState:300
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier:89
> org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.prepareSnapshotPreBarrier:165
> org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.flush:494
> org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.yieldIfThereExistsInFlightRequests:503
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield:84
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take:149
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await:2211
> java.util.concurrent.locks.LockSupport.parkNanos:234
> jdk.internal.misc.Unsafe.park:-2 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to