Hi
    You can try to find out is there is some hot method, or the snapshot
stack is waiting for some lock. and maybe
Best,
Congxian


Deshpande, Omkar <omkar_deshpa...@intuit.com> 于2020年9月15日周二 下午12:30写道:

> Few of the subtasks fail. I cannot upgrade to 1.11 yet. But I can still
> get the thread dump. What should I be looking for in the thread dump?
>
> ------------------------------
> *From:* Yun Tang <myas...@live.com>
> *Sent:* Monday, September 14, 2020 8:52 PM
> *To:* Deshpande, Omkar <omkar_deshpa...@intuit.com>; user@flink.apache.org
> <user@flink.apache.org>
> *Subject:* Re: flink checkpoint timeout
>
> This email is from an external sender.
>
> Hi Omkar
>
> First of all, you should check the web UI of checkpoint [1] to see whether
> many subtasks fail to complete in time or just few of them. The former one
> might be your checkpoint time out is not enough for current case. The later
> one might be some task stuck in slow machine or cannot grab checkpoint lock
> to process sync phase of checkpointing, you can use thread dump [2] (needs
> to bump to Flink-1.11) or jstack to see what happened in java process.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/checkpoint_monitoring.html
> [2] https://issues.apache.org/jira/browse/FLINK-14816
>
> Best
> Yun Tang
> ------------------------------
> *From:* Deshpande, Omkar <omkar_deshpa...@intuit.com>
> *Sent:* Tuesday, September 15, 2020 10:25
> *To:* user@flink.apache.org <user@flink.apache.org>
> *Subject:* Re: flink checkpoint timeout
>
> I have followed this
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html
> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html#container-cut-off-memory>
> and I am using taskmanager.memory.flink.size now instead of
> taskmanager.heap.size
> ------------------------------
> *From:* Deshpande, Omkar <omkar_deshpa...@intuit.com>
> *Sent:* Monday, September 14, 2020 6:23 PM
> *To:* user@flink.apache.org <user@flink.apache.org>
> *Subject:* flink checkpoint timeout
>
> This email is from an external sender.
>
> Hello,
>
> I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds
> first couple of times and then starts failing because of timeouts. The
> checkpoint time grows with every checkpoint and starts exceeding 10
> minutes. I do not see any exceptions in the logs. I have enabled debug
> logging at "org.apache.flink" level. How do I investigate this? The garbage
> collection seems fine. There is no backpressure. This used to work as is
> with flink 1.9 without any issue.
>
> Any pointers on how to investigate long time taken to complete checkpoint?
>
> Omkar
>

Reply via email to