Re: Debugging long Flink checkpoint durations

Dan Hill Thu, 04 Mar 2021 12:38:58 -0800

I dove deeper into it and made a little more progress (by giving more
resources).


Here is a screenshot of one bottleneck:
https://drive.google.com/file/d/1CIatEuIJwmKjBE9__RihVlxSilchtKS1/view

My job isn't making any progress.  It's checkpointing and failing.  The
taskmaster text logs are empty during the checkpoint.  It's not clear if
the checkpoint is making any progress.
https://drive.google.com/file/d/1slLO6PJVhXfoAN5OrSqsE9G7kvHPXJnl/view?usp=sharing

I spent some time changing the memory parameters but it's unclear if I'm
making forward progress.
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup_tm.html




On Tue, Mar 2, 2021 at 3:45 PM Dan Hill <quietgol...@gmail.com> wrote:

> Thanks!  Yes, I've looked at these.   My job is facing backpressure
> starting at an early join step.  I'm unclear if more time is fine for the
> backfill or if I need more resources.
>
> On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <yungao...@aliyun.com> wrote:
>
>> Hi Dan,
>>
>> I think you could see the detail of the checkpoints via the checkpoint
>> UI[1]. Also, if you see in the
>> pending checkpoints some tasks do not take snapshot,  you might have a
>> look whether this task
>> is backpressuring the previous tasks [2].
>>
>> Best,
>> Yun
>>
>>
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
>>
>> ------------------------------------------------------------------
>> Sender:Dan Hill<quietgol...@gmail.com>
>> Date:2021/03/02 04:34:56
>> Recipient:user<user@flink.apache.org>
>> Theme:Debugging long Flink checkpoint durations
>>
>> Hi.  Are there good ways to debug long Flink checkpoint durations?
>>
>> I'm running a backfill job that runs ~10 days of data and then starts
>> checkpointing failing.  Since I only see the last 10 checkpoints in the
>> jobmaster UI, I don't see when it starts.
>>
>> I looked through the text logs and didn't see much.
>>
>> I assume:
>> 1) I have something misconfigured that is causing old state is sticking
>> around.
>> 2) I don't have enough resources.
>>
>>
>>

Re: Debugging long Flink checkpoint durations

Reply via email to