Hi Omkar

First of all, you should check the web UI of checkpoint [1] to see whether many 
subtasks fail to complete in time or just few of them. The former one might be 
your checkpoint time out is not enough for current case. The later one might be 
some task stuck in slow machine or cannot grab checkpoint lock to process sync 
phase of checkpointing, you can use thread dump [2] (needs to bump to 
Flink-1.11) or jstack to see what happened in java process.

[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/checkpoint_monitoring.html
[2] https://issues.apache.org/jira/browse/FLINK-14816

Best
Yun Tang
________________________________
From: Deshpande, Omkar <omkar_deshpa...@intuit.com>
Sent: Tuesday, September 15, 2020 10:25
To: user@flink.apache.org <user@flink.apache.org>
Subject: Re: flink checkpoint timeout

I have followed this 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html<https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html#container-cut-off-memory>
and I am using taskmanager.memory.flink.size now instead of 
taskmanager.heap.size
________________________________
From: Deshpande, Omkar <omkar_deshpa...@intuit.com>
Sent: Monday, September 14, 2020 6:23 PM
To: user@flink.apache.org <user@flink.apache.org>
Subject: flink checkpoint timeout

This email is from an external sender.

Hello,

I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds first 
couple of times and then starts failing because of timeouts. The checkpoint 
time grows with every checkpoint and starts exceeding 10 minutes. I do not see 
any exceptions in the logs. I have enabled debug logging at "org.apache.flink" 
level. How do I investigate this? The garbage collection seems fine. There is 
no backpressure. This used to work as is with flink 1.9 without any issue.

Any pointers on how to investigate long time taken to complete checkpoint?

Omkar

Reply via email to