Thanks! Yes, I've looked at these. My job is facing backpressure starting at an early join step. I'm unclear if more time is fine for the backfill or if I need more resources.
On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <yungao...@aliyun.com> wrote: > Hi Dan, > > I think you could see the detail of the checkpoints via the checkpoint > UI[1]. Also, if you see in the > pending checkpoints some tasks do not take snapshot, you might have a > look whether this task > is backpressuring the previous tasks [2]. > > Best, > Yun > > > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html > > ------------------------------------------------------------------ > Sender:Dan Hill<quietgol...@gmail.com> > Date:2021/03/02 04:34:56 > Recipient:user<user@flink.apache.org> > Theme:Debugging long Flink checkpoint durations > > Hi. Are there good ways to debug long Flink checkpoint durations? > > I'm running a backfill job that runs ~10 days of data and then starts > checkpointing failing. Since I only see the last 10 checkpoints in the > jobmaster UI, I don't see when it starts. > > I looked through the text logs and didn't see much. > > I assume: > 1) I have something misconfigured that is causing old state is sticking > around. > 2) I don't have enough resources. > > >