seth saperstein created FLINK-29099:
---------------------------------------

             Summary: Deadlock for Single Subtask in Kinesis Consumer
                 Key: FLINK-29099
                 URL: https://issues.apache.org/jira/browse/FLINK-29099
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Kinesis
    Affects Versions: 1.15.2, 1.14.5, 1.13.6, 1.12.7, 1.11.6, 1.10.3, 1.9.3
            Reporter: seth saperstein


Deadlock is reached as the result of:
 * max lookahead reached for local watermark
 * idle state for subtask

The lookahead prevents the RecordEmitter from emitting a new record. The idle 
state prevents the global watermark from being updated.

To exit this deadlock state, we need to complete the [TODO 
here|https://github.com/apache/flink/blob/221d70d9930f72147422ea24b399f006ebbfb8d7/flink-connectors/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/internals/KinesisDataFetcher.java#L1268]
 which updates the global watermark while the subtask is marked idle, which 
will then allow us to emit a record again as the lookahead is no longer reached.

 

*Context:*

We reached this scenario at Lyft as a result of prolonged CPU throttling on all 
FlinkKinesisConsumer threads for multiple minutes.

Walking through the series of events:
 * prolonged CPU throttling occurs and no logs are seen from any 
FlinkKinesisConsumer thread for up to 15 minutes
 * after CPU throttling the subtask is marked idle
 * the subtask has reached the lookahead for its local watermark relative to 
the global watermark
 * WatermarkSyncCallback indicates the subtask as idle and does not update the 
global watermark
 * emitQueue fills to max
 * RecordEmitter cannot emit records due to the max lookahead
 * Deadlock on subtask

At this point, we had not realized what had happened and processing of all 
other shards/subtasks had continued for multiple days. When we finally 
restarted the application, we saw the following behavior:
 * global watermark recalculated after all subtasks consumed data based on the 
last kinesis record sequence number
 * global watermark moved back in time multiple days, to when the subtask was 
first marked idle
 * the single subtask processed data while all others remained idle due to the 
lookahead

This would have continued until the subtask had caught up to the others and 
thus the global watermark is within reach of the lookahead for other subtasks.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to