I'm not 100% sure but from the given information this might be related to FLINK-14498 [1] and partially relieved by FLINK-16645 [2].
@Omkar Could you try the 1.11.0 release out and see whether the issue disappeared? @zhijiang <wangzhijiang...@aliyun.com> @yingjie could you also take a look here? Thanks. Best Regards, Yu [1] https://issues.apache.org/jira/browse/FLINK-14498 [2] https://issues.apache.org/jira/browse/FLINK-16645 On Fri, 18 Sep 2020 at 09:28, Deshpande, Omkar <omkar_deshpa...@intuit.com> wrote: > These are the hostspot method. Any pointers on debugging this? The > checkpoints keep timing out since migrating to 1.10 from 1.9 > ------------------------------ > *From:* Deshpande, Omkar <omkar_deshpa...@intuit.com> > *Sent:* Wednesday, September 16, 2020 5:27 PM > *To:* Congxian Qiu <qcx978132...@gmail.com> > *Cc:* user@flink.apache.org <user@flink.apache.org>; Yun Tang < > myas...@live.com> > *Subject:* Re: flink checkpoint timeout > > This email is from an external sender. > > This thread seems to stuck in awaiting notification state - > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231) > > ------------------------------ > *From:* Congxian Qiu <qcx978132...@gmail.com> > *Sent:* Monday, September 14, 2020 10:57 PM > *To:* Deshpande, Omkar <omkar_deshpa...@intuit.com> > *Cc:* user@flink.apache.org <user@flink.apache.org> > *Subject:* Re: flink checkpoint timeout > > This email is from an external sender. > > Hi > You can try to find out is there is some hot method, or the snapshot > stack is waiting for some lock. and maybe > Best, > Congxian > > > Deshpande, Omkar <omkar_deshpa...@intuit.com> 于2020年9月15日周二 下午12:30写道: > > Few of the subtasks fail. I cannot upgrade to 1.11 yet. But I can still > get the thread dump. What should I be looking for in the thread dump? > > ------------------------------ > *From:* Yun Tang <myas...@live.com> > *Sent:* Monday, September 14, 2020 8:52 PM > *To:* Deshpande, Omkar <omkar_deshpa...@intuit.com>; user@flink.apache.org > <user@flink.apache.org> > *Subject:* Re: flink checkpoint timeout > > This email is from an external sender. > > Hi Omkar > > First of all, you should check the web UI of checkpoint [1] to see whether > many subtasks fail to complete in time or just few of them. The former one > might be your checkpoint time out is not enough for current case. The later > one might be some task stuck in slow machine or cannot grab checkpoint lock > to process sync phase of checkpointing, you can use thread dump [2] (needs > to bump to Flink-1.11) or jstack to see what happened in java process. > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/checkpoint_monitoring.html > [2] https://issues.apache.org/jira/browse/FLINK-14816 > > Best > Yun Tang > ------------------------------ > *From:* Deshpande, Omkar <omkar_deshpa...@intuit.com> > *Sent:* Tuesday, September 15, 2020 10:25 > *To:* user@flink.apache.org <user@flink.apache.org> > *Subject:* Re: flink checkpoint timeout > > I have followed this > https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html > <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html#container-cut-off-memory> > and I am using taskmanager.memory.flink.size now instead of > taskmanager.heap.size > ------------------------------ > *From:* Deshpande, Omkar <omkar_deshpa...@intuit.com> > *Sent:* Monday, September 14, 2020 6:23 PM > *To:* user@flink.apache.org <user@flink.apache.org> > *Subject:* flink checkpoint timeout > > This email is from an external sender. > > Hello, > > I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds > first couple of times and then starts failing because of timeouts. The > checkpoint time grows with every checkpoint and starts exceeding 10 > minutes. I do not see any exceptions in the logs. I have enabled debug > logging at "org.apache.flink" level. How do I investigate this? The garbage > collection seems fine. There is no backpressure. This used to work as is > with flink 1.9 without any issue. > > Any pointers on how to investigate long time taken to complete checkpoint? > > Omkar > >