Pauli Gandhi created FLINK-28031: ------------------------------------ Summary: Checkpoint always hangs when running some jobs Key: FLINK-28031 URL: https://issues.apache.org/jira/browse/FLINK-28031 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.14.3 Reporter: Pauli Gandhi
We have noticed that Flink jobs hangs and eventually times out after 2 hours every time at the first checkpoint after it completes 15/23 acknowledgments (65%). There is no cpu activity but yet there are number of tasks reporting 100% back pressure. It is peculiar to this job and slight modifications to this job. We have created many Flink jobs in the past and never encountered the issue. Here are the things we tried to narrow down the problem * The job runs fine if checkpointing is disabled. * Increasing the number of task managers and parallelism to 2 seems to help the job complete. However, it stalled again when we sent a larger data set. * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but didn't help. * Sometimes restarting the job manager helps but at other times not. * Breaking up the job into smaller parts helps the job to finish. * Analyzed the the thread dump and it appears all threads are either in sleeping or wait state. Here are the environment details * Flink version 1.14.3 * Running Kubernetes * Using RocksDB state backend. * Checkpoint storage is S3 storage using the Presto library * Exactly Once Semantics with unaligned checkpoints enabled. * Checkpoint timeout 2 hours * Maximum concurrent checkpoints is 1 * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB * Using Kafka for input and output I have attached the task manager logs, thread dump, and screen shots of the job graph and stalled checkpoint. -- This message was sent by Atlassian Jira (v8.20.7#820007)