Hi You can try to find out is there is some hot method, or the snapshot stack is waiting for some lock. and maybe Best, Congxian
Deshpande, Omkar <omkar_deshpa...@intuit.com> 于2020年9月15日周二 下午12:30写道: > Few of the subtasks fail. I cannot upgrade to 1.11 yet. But I can still > get the thread dump. What should I be looking for in the thread dump? > > ------------------------------ > *From:* Yun Tang <myas...@live.com> > *Sent:* Monday, September 14, 2020 8:52 PM > *To:* Deshpande, Omkar <omkar_deshpa...@intuit.com>; user@flink.apache.org > <user@flink.apache.org> > *Subject:* Re: flink checkpoint timeout > > This email is from an external sender. > > Hi Omkar > > First of all, you should check the web UI of checkpoint [1] to see whether > many subtasks fail to complete in time or just few of them. The former one > might be your checkpoint time out is not enough for current case. The later > one might be some task stuck in slow machine or cannot grab checkpoint lock > to process sync phase of checkpointing, you can use thread dump [2] (needs > to bump to Flink-1.11) or jstack to see what happened in java process. > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/checkpoint_monitoring.html > [2] https://issues.apache.org/jira/browse/FLINK-14816 > > Best > Yun Tang > ------------------------------ > *From:* Deshpande, Omkar <omkar_deshpa...@intuit.com> > *Sent:* Tuesday, September 15, 2020 10:25 > *To:* user@flink.apache.org <user@flink.apache.org> > *Subject:* Re: flink checkpoint timeout > > I have followed this > https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html > <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html#container-cut-off-memory> > and I am using taskmanager.memory.flink.size now instead of > taskmanager.heap.size > ------------------------------ > *From:* Deshpande, Omkar <omkar_deshpa...@intuit.com> > *Sent:* Monday, September 14, 2020 6:23 PM > *To:* user@flink.apache.org <user@flink.apache.org> > *Subject:* flink checkpoint timeout > > This email is from an external sender. > > Hello, > > I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds > first couple of times and then starts failing because of timeouts. The > checkpoint time grows with every checkpoint and starts exceeding 10 > minutes. I do not see any exceptions in the logs. I have enabled debug > logging at "org.apache.flink" level. How do I investigate this? The garbage > collection seems fine. There is no backpressure. This used to work as is > with flink 1.9 without any issue. > > Any pointers on how to investigate long time taken to complete checkpoint? > > Omkar >