I'm not 100% sure but from the given information this might be related to
FLINK-14498 [1] and partially relieved by FLINK-16645 [2].

@Omkar Could you try the 1.11.0 release out and see whether the issue
disappeared?

@zhijiang <wangzhijiang...@aliyun.com> @yingjie could you also take a look
here? Thanks.

Best Regards,
Yu

[1] https://issues.apache.org/jira/browse/FLINK-14498
[2] https://issues.apache.org/jira/browse/FLINK-16645


On Fri, 18 Sep 2020 at 09:28, Deshpande, Omkar <omkar_deshpa...@intuit.com>
wrote:

> These are the hostspot method. Any pointers on debugging this? The
> checkpoints keep timing out since migrating to 1.10 from 1.9
> ------------------------------
> *From:* Deshpande, Omkar <omkar_deshpa...@intuit.com>
> *Sent:* Wednesday, September 16, 2020 5:27 PM
> *To:* Congxian Qiu <qcx978132...@gmail.com>
> *Cc:* user@flink.apache.org <user@flink.apache.org>; Yun Tang <
> myas...@live.com>
> *Subject:* Re: flink checkpoint timeout
>
> This email is from an external sender.
>
> This thread seems to stuck in awaiting notification state -
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)
>
> ------------------------------
> *From:* Congxian Qiu <qcx978132...@gmail.com>
> *Sent:* Monday, September 14, 2020 10:57 PM
> *To:* Deshpande, Omkar <omkar_deshpa...@intuit.com>
> *Cc:* user@flink.apache.org <user@flink.apache.org>
> *Subject:* Re: flink checkpoint timeout
>
> This email is from an external sender.
>
> Hi
>     You can try to find out is there is some hot method, or the snapshot
> stack is waiting for some lock. and maybe
> Best,
> Congxian
>
>
> Deshpande, Omkar <omkar_deshpa...@intuit.com> 于2020年9月15日周二 下午12:30写道:
>
> Few of the subtasks fail. I cannot upgrade to 1.11 yet. But I can still
> get the thread dump. What should I be looking for in the thread dump?
>
> ------------------------------
> *From:* Yun Tang <myas...@live.com>
> *Sent:* Monday, September 14, 2020 8:52 PM
> *To:* Deshpande, Omkar <omkar_deshpa...@intuit.com>; user@flink.apache.org
> <user@flink.apache.org>
> *Subject:* Re: flink checkpoint timeout
>
> This email is from an external sender.
>
> Hi Omkar
>
> First of all, you should check the web UI of checkpoint [1] to see whether
> many subtasks fail to complete in time or just few of them. The former one
> might be your checkpoint time out is not enough for current case. The later
> one might be some task stuck in slow machine or cannot grab checkpoint lock
> to process sync phase of checkpointing, you can use thread dump [2] (needs
> to bump to Flink-1.11) or jstack to see what happened in java process.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/checkpoint_monitoring.html
> [2] https://issues.apache.org/jira/browse/FLINK-14816
>
> Best
> Yun Tang
> ------------------------------
> *From:* Deshpande, Omkar <omkar_deshpa...@intuit.com>
> *Sent:* Tuesday, September 15, 2020 10:25
> *To:* user@flink.apache.org <user@flink.apache.org>
> *Subject:* Re: flink checkpoint timeout
>
> I have followed this
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html
> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html#container-cut-off-memory>
> and I am using taskmanager.memory.flink.size now instead of
> taskmanager.heap.size
> ------------------------------
> *From:* Deshpande, Omkar <omkar_deshpa...@intuit.com>
> *Sent:* Monday, September 14, 2020 6:23 PM
> *To:* user@flink.apache.org <user@flink.apache.org>
> *Subject:* flink checkpoint timeout
>
> This email is from an external sender.
>
> Hello,
>
> I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds
> first couple of times and then starts failing because of timeouts. The
> checkpoint time grows with every checkpoint and starts exceeding 10
> minutes. I do not see any exceptions in the logs. I have enabled debug
> logging at "org.apache.flink" level. How do I investigate this? The garbage
> collection seems fine. There is no backpressure. This used to work as is
> with flink 1.9 without any issue.
>
> Any pointers on how to investigate long time taken to complete checkpoint?
>
> Omkar
>
>

Reply via email to