Hello all, We recently start to experience Checkpoint timeout randomly. Here are some background information 1. We are on Flink 1.13.1 2. We have been running these type of streaming jobs for years. When checkpoint succeeds, it only take a few seconds. After a week ago, we start to see random checkpoint time outs. When it timeout, feels like it stuck somewhere, couldn’t more forward. After timeout, the job was able to continue from the previous checkpoint and move forward. 3. Our job has quite many parallelisms. 50 ~ 100s. Looking at the checkpoint page. We saw 1 of the subtasks are not acknowledging, which eventually lead to the timeout. 4. The Flink job is running on AWS EKS, the nature of job is relatively simple, read from AWS kinesis and do some transformation and write parquet files to AWS s3.
My goal is to seek some suggestions of where to start trouble shooting. Below is TaskManager log around the time when checkpoint timeout ================================================== 2023-05-25 14:47:30,248 INFO org.apache.parquet.hadoop.InternalParquetRecordWriter [] - Flushing mem columnStore to file. allocated memory: 122742926 2023-05-25 14:47:32,904 INFO org.apache.parquet.hadoop.InternalParquetRecordWriter [] - mem size 134322047 > 134217728: flushing 343857 records to disk. 2023-05-25 14:47:32,979 INFO org.apache.parquet.hadoop.InternalParquetRecordWriter [] - Flushing mem columnStore to file. allocated memory: 122816246 2023-05-25 14:47:34,530 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0). 2023-05-25 14:47:34,530 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0) switched from RUNNING to CANCELING. 2023-05-25 14:47:34,530 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (31/50)#0 (c2905cda734172afa6675014ca1271a0). 2023-05-25 14:47:34,531 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 30 ... 2023-05-25 14:47:34,532 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 30 ... 2023-05-25 14:47:34,531 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 30 ... java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.j ar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302] 2023-05-25 14:47:34,535 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 30 ... 2023-05-25 14:47:34,545 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6). 2023-05-25 14:47:34,545 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6) switched from RUNNING to CANCELING. 2023-05-25 14:47:34,545 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (30/50)#0 (0ad93341f291a2aa84be39556b1362e6). 2023-05-25 14:47:34,546 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 29 ... 2023-05-25 14:47:34,550 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc). 2023-05-25 14:47:34,550 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc) switched from RUNNING to CANCELING. 2023-05-25 14:47:34,550 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (29/50)#0 (cb16394cd68fe4775987de4d4e6dc6bc). 2023-05-25 14:47:34,550 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 29 ... java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.j ar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302] 2023-05-25 14:47:34,551 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 29 ... 2023-05-25 14:47:34,552 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 29 ... 2023-05-25 14:47:34,552 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 28 ... 2023-05-25 14:47:34,552 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 28 ... 2023-05-25 14:47:34,552 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed). 2023-05-25 14:47:34,552 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed) switched from RUNNING to CANCELING. 2023-05-25 14:47:34,552 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (28/50)#0 (4e1286825df33628e63ea9e4ef6d17ed). 2023-05-25 14:47:34,552 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 28 ... java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.j ar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302] 2023-05-25 14:47:34,559 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 27 ... 2023-05-25 14:47:34,559 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map de compressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306). 2023-05-25 14:47:34,559 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> fl at events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306) switched from RUNNING to CANCELING. 2023-05-25 14:47:34,559 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (34/50)#0 (211379391d2c2a39dd2d11b50439a306). 2023-05-25 14:47:34,559 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 27 ... java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302] 2023-05-25 14:47:34,559 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 33 ... 2023-05-25 14:47:34,560 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 28 ... 2023-05-25 14:47:34,561 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 27 ... 2023-05-25 14:47:34,561 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to cancel task Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed t o events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0 (baf2652040493f67373c3877b825a1d1). 2023-05-25 14:47:34,561 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map decompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0 (baf2652040493f67373c3877b825a1d1) switched from RUNNING to CANCELING. 2023-05-25 14:47:34,561 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code Source: psc-uds-p9-eedr-random-sample-cdca-prod -> map batch to events -> (map dec ompressed to events -> flat events -> (Sink: Parquet Sink, Filter -> Sink: Corrupted Event Sink), Process -> Map -> Sink: Parquet PB Raw Sink) (33/50)#0 (baf2652040493f67373c3877b825a1d1). 2023-05-25 14:47:34,561 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 33 ... java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_302] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.adjustRunLoopFrequency(PollingRecordPublisher.java:216) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:124) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.publisher.polling.PollingRecordPublisher.run(PollingRecordPublisher.java:102) ~[flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:114) [flink-connector-kinesis_2.11-1.13.1.jar:1.13.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_302] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_302] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302] 2023-05-25 14:47:34,562 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 32 ... 2023-05-25 14:47:34,564 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 32 ... 2023-05-25 14:47:34,564 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Shutting down the shard consumer threads of subtask 33 ... 2023-05-25 14:47:34,564 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher [] - Starting shutdown of shard consumer threads and AWS SDK resources of subtask 32 ... =================================== Job manager log only has this =================================== org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:98) at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:67) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1934) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1906) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:96) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1990) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ============================== Thanks in advance Ivan