The checkpoint was only acknowledged shortly after it was started.

On Thu, Mar 4, 2021 at 12:38 PM Dan Hill <quietgol...@gmail.com> wrote:

> I dove deeper into it and made a little more progress (by giving more
> resources).
>
> Here is a screenshot of one bottleneck:
> https://drive.google.com/file/d/1CIatEuIJwmKjBE9__RihVlxSilchtKS1/view
>
> My job isn't making any progress.  It's checkpointing and failing.  The
> taskmaster text logs are empty during the checkpoint.  It's not clear if
> the checkpoint is making any progress.
>
> https://drive.google.com/file/d/1slLO6PJVhXfoAN5OrSqsE9G7kvHPXJnl/view?usp=sharing
>
> I spent some time changing the memory parameters but it's unclear if I'm
> making forward progress.
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup_tm.html
>
>
>
>
> On Tue, Mar 2, 2021 at 3:45 PM Dan Hill <quietgol...@gmail.com> wrote:
>
>> Thanks!  Yes, I've looked at these.   My job is facing backpressure
>> starting at an early join step.  I'm unclear if more time is fine for the
>> backfill or if I need more resources.
>>
>> On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <yungao...@aliyun.com> wrote:
>>
>>> Hi Dan,
>>>
>>> I think you could see the detail of the checkpoints via the checkpoint
>>> UI[1]. Also, if you see in the
>>> pending checkpoints some tasks do not take snapshot,  you might have a
>>> look whether this task
>>> is backpressuring the previous tasks [2].
>>>
>>> Best,
>>> Yun
>>>
>>>
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
>>>
>>> ------------------------------------------------------------------
>>> Sender:Dan Hill<quietgol...@gmail.com>
>>> Date:2021/03/02 04:34:56
>>> Recipient:user<user@flink.apache.org>
>>> Theme:Debugging long Flink checkpoint durations
>>>
>>> Hi.  Are there good ways to debug long Flink checkpoint durations?
>>>
>>> I'm running a backfill job that runs ~10 days of data and then starts
>>> checkpointing failing.  Since I only see the last 10 checkpoints in the
>>> jobmaster UI, I don't see when it starts.
>>>
>>> I looked through the text logs and didn't see much.
>>>
>>> I assume:
>>> 1) I have something misconfigured that is causing old state is sticking
>>> around.
>>> 2) I don't have enough resources.
>>>
>>>
>>>

Reply via email to