Hi Xiangyu Su, Because of the lack of detailed information, I could only give the troubleshooting ideas. I hope it is helpful to you. 1. find out which checkpoint expire. You could find that in WEB UI [1] or in `jobmanager.log` 2. find out operators which not finished checkpoint yet when the checkpoint expire. You could find that in WEB UI checkpoint detailed information [1] 3. find out which stage of expired operator is slow, align duration or sync duration or async duration [1] If operator spent a long time in align duration, please check whether the job exists back pressure. You could find that in WEB UI BackPressure part. You can enable unaligned checkpoints <https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/unaligned_checkpoints/> [2] to greatly reduce checkpointing times under backpressure. If operator spent a long time in async duration, you could check whether there is any network problem.
[1] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/monitoring/checkpoint_monitoring/ [2] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/unaligned_checkpoints/ Best, JING ZHANG Xiangyu Su <xian...@smaato.com> 于2021年9月1日周三 下午3:52写道: > Hello Everyone, > We were facing checkpointing failure issue since version 1.9, currently we > are using version 1.13.2 > > We are using filesystem(s3) as statebackend, 10 mins checkpoint timeout, > usually the checkpoint takes 10-30 seconds. > But sometimes I have seen Job failed and restarted due to checkpoint > timeout without huge increasing of incoming data... and also seen the > checkpointing progress of some subtasks get stuck by e.g 7% for 10 mins. > My guess would be somehow the thread for doing checkpointing get blocked... > > Any suggestions? idea will be helpful, thanks > > > Best Regards, > -- > Xiangyu Su > Java Developer > xian...@smaato.com > > Smaato Inc. > San Francisco - New York - Hamburg - Singapore > www.smaato.com > > Germany: > > Barcastraße 5 > > 22087 Hamburg > > Germany > M 0049(176)43330282 > > The information contained in this communication may be CONFIDENTIAL and is > intended only for the use of the recipient(s) named above. If you are not > the intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication, or any of its contents, is > strictly prohibited. If you have received this communication in error, > please notify the sender and delete/destroy the original message and any > copy of it from your computer or paper files. >