Hi Xiangyu Su,
Because of the lack of detailed information, I could only give the
troubleshooting ideas. I hope it is helpful to you.
1. find out which checkpoint expire. You could find that in WEB UI [1] or
in `jobmanager.log`
2. find out operators which not finished checkpoint yet when the checkp
Hi Xiangyu,
Can you provide us with more information about your job, which state
backend you are using and how you've configured the checkpointing? Can you
also provide some information about the problematic checkpoints (e.g.
alignment time, async/sync duration) that you find on the checkpoint
det
Hello Everyone,
Hello Till,
We were facing checkpointing failure issue since version 1.9, currently we
are using version 1.13.2
We are using filesystem(s3) as statebackend, 10 mins checkpoint timeout,
usually the checkpoint takes 10-30 seconds.
But sometimes I have seen Job failed and restarted d
Hello Everyone,
We were facing checkpointing failure issue since version 1.9, currently we
are using version 1.13.2
We are using filesystem(s3) as statebackend, 10 mins checkpoint timeout,
usually the checkpoint takes 10-30 seconds.
But sometimes I have seen Job failed and restarted due to checkp