Re: Debugging long Flink checkpoint durations

2021-03-04 Thread Dan Hill
The checkpoint was only acknowledged shortly after it was started. On Thu, Mar 4, 2021 at 12:38 PM Dan Hill wrote: > I dove deeper into it and made a little more progress (by giving more > resources). > > Here is a screenshot of one bottleneck: > https://drive.google.com/file/d/1CIatEuIJwmKjBE9_

Re: Debugging long Flink checkpoint durations

2021-03-04 Thread Dan Hill
I dove deeper into it and made a little more progress (by giving more resources). Here is a screenshot of one bottleneck: https://drive.google.com/file/d/1CIatEuIJwmKjBE9__RihVlxSilchtKS1/view My job isn't making any progress. It's checkpointing and failing. The taskmaster text logs are empty d

Re: Debugging long Flink checkpoint durations

2021-03-02 Thread Dan Hill
Thanks! Yes, I've looked at these. My job is facing backpressure starting at an early join step. I'm unclear if more time is fine for the backfill or if I need more resources. On Tue, Mar 2, 2021 at 12:50 AM Yun Gao wrote: > Hi Dan, > > I think you could see the detail of the checkpoints via

Re: Debugging long Flink checkpoint durations

2021-03-02 Thread Yun Gao
Hi Dan, I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the pending checkpoints some tasks do not take snapshot, you might have a look whether this task is backpressuring the previous tasks [2]. Best, Yun [1] https://ci.apache.org/projects/