Antonio Vespoli created FLINK-32230:
---------------------------------------

             Summary: Deadlock in AWS Kinesis Data Streams connector
                 Key: FLINK-32230
                 URL: https://issues.apache.org/jira/browse/FLINK-32230
             Project: Flink
          Issue Type: Bug
          Components: Connectors / AWS
    Affects Versions: 1.17.1, 1.16.2, 1.15.4
            Reporter: Antonio Vespoli
             Fix For: aws-connector-3.1.0, aws-connector-4.2.0


Connector calls to AWS Kinesis Data Streams can hang indefinitely without 
making any progress.

We suspect the root cause to be related to the SDK handling of exceptions, 
similarly to what observed in FLINK-31675.

We identified this deadlock on applications running on AWS Kinesis Data 
Analytics using the AWS Kinesis Data Streams connectors (with AWS SDK version 
2.20.32 as per FLINK-31675). The deadlock scenario is still the same as 
described in FLINK-31675. However, the Netty content-length exception does not 
appear when using the updated SDK version.

This issue only occurs for applications and streams in the AWS regions 
_ap-northeast-3_ and {_}us-gov-east-1{_}. We did not observe this issue in any 
other AWS region.

The issue happens sporadically and unpredictably. As per its nature, we do not 
have instructions for reproducing it.

Example of flame-graphs observed when the issue occurs:


{code:java}
root
java.lang.Thread.run:829
org.apache.flink.runtime.taskmanager.Task.run:568
org.apache.flink.runtime.taskmanager.Task.doRun:746
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke:932
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring:953
org.apache.flink.runtime.taskmanager.Task$$Lambda$1253/0x0000000800ecbc40.run:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke:753
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop:804
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop:203
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$907/0x0000000800bf7840.runDefaultAction:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput:519
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput:65
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext:110
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext:159
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent:181
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier:231
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState:262
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$1586/0x00000008012c5c40.apply:-1
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2:234
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived:66
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint:74
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint:493
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100:64
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint:287
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint:147
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier:1198
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint:1241
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing:50
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1453/0x000000080128e840.run:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12:1253
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState:300
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier:89
org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.prepareSnapshotPreBarrier:165
org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.flush:494
org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.yieldIfThereExistsInFlightRequests:503
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield:84
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take:149
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await:2211
java.util.concurrent.locks.LockSupport.parkNanos:234
jdk.internal.misc.Unsafe.park:-2 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to