Re: Checkpointing failure, subtasks get stuck

2021-09-02 Thread JING ZHANG
Hi Xiangyu Su, Because of the lack of detailed information, I could only give the troubleshooting ideas. I hope it is helpful to you. 1. find out which checkpoint expire. You could find that in WEB UI [1] or in `jobmanager.log` 2. find out operators which not finished checkpoint yet when the checkp

Re: Checkpointing failure, subtasks get stuck

2021-09-02 Thread Till Rohrmann
Hi Xiangyu, Can you provide us with more information about your job, which state backend you are using and how you've configured the checkpointing? Can you also provide some information about the problematic checkpoints (e.g. alignment time, async/sync duration) that you find on the checkpoint det

Checkpointing failure, subtasks get stuck

2021-09-02 Thread Xiangyu Su
Hello Everyone, Hello Till, We were facing checkpointing failure issue since version 1.9, currently we are using version 1.13.2 We are using filesystem(s3) as statebackend, 10 mins checkpoint timeout, usually the checkpoint takes 10-30 seconds. But sometimes I have seen Job failed and restarted d

Checkpointing failure, subtasks get stuck

2021-09-01 Thread Xiangyu Su
Hello Everyone, We were facing checkpointing failure issue since version 1.9, currently we are using version 1.13.2 We are using filesystem(s3) as statebackend, 10 mins checkpoint timeout, usually the checkpoint takes 10-30 seconds. But sometimes I have seen Job failed and restarted due to checkp