I dove deeper into it and made a little more progress (by giving more resources).
Here is a screenshot of one bottleneck: https://drive.google.com/file/d/1CIatEuIJwmKjBE9__RihVlxSilchtKS1/view My job isn't making any progress. It's checkpointing and failing. The taskmaster text logs are empty during the checkpoint. It's not clear if the checkpoint is making any progress. https://drive.google.com/file/d/1slLO6PJVhXfoAN5OrSqsE9G7kvHPXJnl/view?usp=sharing I spent some time changing the memory parameters but it's unclear if I'm making forward progress. https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup_tm.html On Tue, Mar 2, 2021 at 3:45 PM Dan Hill <quietgol...@gmail.com> wrote: > Thanks! Yes, I've looked at these. My job is facing backpressure > starting at an early join step. I'm unclear if more time is fine for the > backfill or if I need more resources. > > On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <yungao...@aliyun.com> wrote: > >> Hi Dan, >> >> I think you could see the detail of the checkpoints via the checkpoint >> UI[1]. Also, if you see in the >> pending checkpoints some tasks do not take snapshot, you might have a >> look whether this task >> is backpressuring the previous tasks [2]. >> >> Best, >> Yun >> >> >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html >> [2] >> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html >> >> ------------------------------------------------------------------ >> Sender:Dan Hill<quietgol...@gmail.com> >> Date:2021/03/02 04:34:56 >> Recipient:user<user@flink.apache.org> >> Theme:Debugging long Flink checkpoint durations >> >> Hi. Are there good ways to debug long Flink checkpoint durations? >> >> I'm running a backfill job that runs ~10 days of data and then starts >> checkpointing failing. Since I only see the last 10 checkpoints in the >> jobmaster UI, I don't see when it starts. >> >> I looked through the text logs and didn't see much. >> >> I assume: >> 1) I have something misconfigured that is causing old state is sticking >> around. >> 2) I don't have enough resources. >> >> >>