Hi Dan, I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the pending checkpoints some tasks do not take snapshot, you might have a look whether this task is backpressuring the previous tasks [2].
Best, Yun [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html ------------------------------------------------------------------ Sender:Dan Hill<quietgol...@gmail.com> Date:2021/03/02 04:34:56 Recipient:user<user@flink.apache.org> Theme:Debugging long Flink checkpoint durations Hi. Are there good ways to debug long Flink checkpoint durations? I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing. Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts. I looked through the text logs and didn't see much. I assume: 1) I have something misconfigured that is causing old state is sticking around. 2) I don't have enough resources.