Hi Omkar First of all, you should check the web UI of checkpoint [1] to see whether many subtasks fail to complete in time or just few of them. The former one might be your checkpoint time out is not enough for current case. The later one might be some task stuck in slow machine or cannot grab checkpoint lock to process sync phase of checkpointing, you can use thread dump [2] (needs to bump to Flink-1.11) or jstack to see what happened in java process.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/checkpoint_monitoring.html [2] https://issues.apache.org/jira/browse/FLINK-14816 Best Yun Tang ________________________________ From: Deshpande, Omkar <omkar_deshpa...@intuit.com> Sent: Tuesday, September 15, 2020 10:25 To: user@flink.apache.org <user@flink.apache.org> Subject: Re: flink checkpoint timeout I have followed this https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html<https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html#container-cut-off-memory> and I am using taskmanager.memory.flink.size now instead of taskmanager.heap.size ________________________________ From: Deshpande, Omkar <omkar_deshpa...@intuit.com> Sent: Monday, September 14, 2020 6:23 PM To: user@flink.apache.org <user@flink.apache.org> Subject: flink checkpoint timeout This email is from an external sender. Hello, I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds first couple of times and then starts failing because of timeouts. The checkpoint time grows with every checkpoint and starts exceeding 10 minutes. I do not see any exceptions in the logs. I have enabled debug logging at "org.apache.flink" level. How do I investigate this? The garbage collection seems fine. There is no backpressure. This used to work as is with flink 1.9 without any issue. Any pointers on how to investigate long time taken to complete checkpoint? Omkar