Pauli Gandhi created FLINK-28031:
------------------------------------

             Summary: Checkpoint always hangs when running some jobs
                 Key: FLINK-28031
                 URL: https://issues.apache.org/jira/browse/FLINK-28031
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.14.3
            Reporter: Pauli Gandhi


We have noticed that Flink jobs hangs and eventually times out after 2 hours 
every time at the first checkpoint after it completes 15/23 acknowledgments 
(65%).  There is no cpu activity but yet there are number of tasks reporting 
100% back pressure.  It is peculiar to this job and slight modifications to 
this job.  We have created many Flink jobs in the past and never encountered 
the issue.  

Here are the things we tried to narrow down the problem
 * The job runs fine if checkpointing is disabled.
 * Increasing the number of task managers and parallelism to 2 seems to help 
the job complete.  However, it stalled again when we sent a larger data set.
 * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but 
didn't help.
 * Sometimes restarting the job manager helps but at other times not.
 * Breaking up the job into smaller parts helps the job to finish.
 * Analyzed the the thread dump and it appears all threads are either in 
sleeping or wait state.

Here are the environment details
 * Flink version 1.14.3
 * Running Kubernetes
 * Using RocksDB state backend.
 * Checkpoint storage is S3 storage using the Presto library
 * Exactly Once Semantics with unaligned checkpoints enabled.
 * Checkpoint timeout 2 hours
 * Maximum concurrent checkpoints is 1
 * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB
 * Using Kafka for input and output

I have attached the task manager logs, thread dump, and screen shots of the job 
graph and stalled checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to