Hi Xiangyu Su,
Because of the lack of detailed information, I could only give the
troubleshooting ideas. I hope it is helpful to you.
1. find out which checkpoint expire. You could find that in WEB UI [1] or
in `jobmanager.log`
2. find out operators which not finished checkpoint yet when the checkpoint
expire. You could find that in WEB UI checkpoint detailed information [1]
3. find out which stage of expired operator is slow, align duration  or
sync duration or async duration [1]
    If operator spent a long time in  align duration, please check whether
the job exists back pressure. You could find that in WEB UI BackPressure
part. You can enable unaligned checkpoints
<https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/unaligned_checkpoints/>
[2] to greatly reduce checkpointing times under backpressure.
    If operator spent a long time in async duration, you could check
whether there is any network problem.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/monitoring/checkpoint_monitoring/
[2]
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/unaligned_checkpoints/

Best,
JING ZHANG

Xiangyu Su <xian...@smaato.com> 于2021年9月1日周三 下午3:52写道:

> Hello Everyone,
> We were facing checkpointing failure issue since version 1.9, currently we
> are using  version 1.13.2
>
> We are using filesystem(s3) as statebackend, 10 mins checkpoint timeout,
> usually the checkpoint takes 10-30 seconds.
> But sometimes I have seen Job failed and restarted due to checkpoint
> timeout without huge increasing of incoming data... and also seen the
> checkpointing progress of some subtasks get stuck by e.g 7% for 10 mins.
> My guess would be somehow the thread for doing checkpointing get blocked...
>
> Any suggestions? idea will be helpful, thanks
>
>
> Best Regards,
> --
> Xiangyu Su
> Java Developer
> xian...@smaato.com
>
> Smaato Inc.
> San Francisco - New York - Hamburg - Singapore
> www.smaato.com
>
> Germany:
>
> Barcastraße 5
>
> 22087 Hamburg
>
> Germany
> M 0049(176)43330282
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above. If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited. If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>

Reply via email to