The checkpoint was only acknowledged shortly after it was started. On Thu, Mar 4, 2021 at 12:38 PM Dan Hill <quietgol...@gmail.com> wrote:
> I dove deeper into it and made a little more progress (by giving more > resources). > > Here is a screenshot of one bottleneck: > https://drive.google.com/file/d/1CIatEuIJwmKjBE9__RihVlxSilchtKS1/view > > My job isn't making any progress. It's checkpointing and failing. The > taskmaster text logs are empty during the checkpoint. It's not clear if > the checkpoint is making any progress. > > https://drive.google.com/file/d/1slLO6PJVhXfoAN5OrSqsE9G7kvHPXJnl/view?usp=sharing > > I spent some time changing the memory parameters but it's unclear if I'm > making forward progress. > > https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup_tm.html > > > > > On Tue, Mar 2, 2021 at 3:45 PM Dan Hill <quietgol...@gmail.com> wrote: > >> Thanks! Yes, I've looked at these. My job is facing backpressure >> starting at an early join step. I'm unclear if more time is fine for the >> backfill or if I need more resources. >> >> On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <yungao...@aliyun.com> wrote: >> >>> Hi Dan, >>> >>> I think you could see the detail of the checkpoints via the checkpoint >>> UI[1]. Also, if you see in the >>> pending checkpoints some tasks do not take snapshot, you might have a >>> look whether this task >>> is backpressuring the previous tasks [2]. >>> >>> Best, >>> Yun >>> >>> >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html >>> [2] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html >>> >>> ------------------------------------------------------------------ >>> Sender:Dan Hill<quietgol...@gmail.com> >>> Date:2021/03/02 04:34:56 >>> Recipient:user<user@flink.apache.org> >>> Theme:Debugging long Flink checkpoint durations >>> >>> Hi. Are there good ways to debug long Flink checkpoint durations? >>> >>> I'm running a backfill job that runs ~10 days of data and then starts >>> checkpointing failing. Since I only see the last 10 checkpoints in the >>> jobmaster UI, I don't see when it starts. >>> >>> I looked through the text logs and didn't see much. >>> >>> I assume: >>> 1) I have something misconfigured that is causing old state is sticking >>> around. >>> 2) I don't have enough resources. >>> >>> >>>