Hi all, I have a proposal that aims to enhance the flink application's resilience in cases of unexpected failures in checkpoint storages like S3 or HDFS,
*[Background]* When using self managed S3-compatible object storage, we faced checkpoint async failures lasting for an extended period more than 30 minutes, leading to multiple job restarts and causing lags in our streaming application. *[Current Behavior]*Currently, when the number of checkpoint failures exceeds a predefined tolerable limit, flink will either restart or fail the job based on how it's configured. In my opinion this does not handle scenarios where the checkpoint storage itself may be unreliable or experiencing downtime. *[Proposed Feature]* I propose a config that allows for a graceful job stop with a savepoint when the tolerable checkpoint failure limit is reached. Instead of restarting/failing the job when tolerable checkpoint failure exceeds, when this new config is set to true just trigger stopWithSavepoint. This could offer the following benefits. - Indication of Checkpoint Storage State: Exceeding tolerable checkpoint failures could indicate unstable checkpoint storage. - Automated Fallback Strategy: When combined with a monitoring cron job, this feature could act as an automated fallback strategy for handling unstable checkpoint storage. The job would stop safely, take a savepoint, and then you could automatically restart with different checkpoint storage configured like switching from S3 to HDFS. For example let's say checkpoint path is configured to s3 and savepoint path is configured to hdfs. When the new config is set to true the job stops with savepoint like below when tolerable checkpoint failure exceeds. And we can restart the job from that savepoint while the checkpoint configured as hdfs. [image: image.png] Looking forward to hearing the community's thoughts on this proposal. And also want to ask how the community is handling long lasting unstable checkpoint storage issues. Thanks in advance. Best dongwoo,