Re: Debugging long Flink checkpoint durations

Dan Hill Tue, 02 Mar 2021 15:46:00 -0800

Thanks!  Yes, I've looked at these.   My job is facing backpressure
starting at an early join step.  I'm unclear if more time is fine for the
backfill or if I need more resources.


On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <yungao...@aliyun.com> wrote:

> Hi Dan,
>
> I think you could see the detail of the checkpoints via the checkpoint
> UI[1]. Also, if you see in the
> pending checkpoints some tasks do not take snapshot,  you might have a
> look whether this task
> is backpressuring the previous tasks [2].
>
> Best,
> Yun
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
>
> ------------------------------------------------------------------
> Sender:Dan Hill<quietgol...@gmail.com>
> Date:2021/03/02 04:34:56
> Recipient:user<user@flink.apache.org>
> Theme:Debugging long Flink checkpoint durations
>
> Hi.  Are there good ways to debug long Flink checkpoint durations?
>
> I'm running a backfill job that runs ~10 days of data and then starts
> checkpointing failing.  Since I only see the last 10 checkpoints in the
> jobmaster UI, I don't see when it starts.
>
> I looked through the text logs and didn't see much.
>
> I assume:
> 1) I have something misconfigured that is causing old state is sticking
> around.
> 2) I don't have enough resources.
>
>
>

Re: Debugging long Flink checkpoint durations

Reply via email to