[DISCUSS] Add config to enable job stop with savepoint on exceeding tolerable checkpoint Failures

Dongwoo Kim Mon, 04 Sep 2023 07:16:29 -0700

Hi all,
I have a proposal that aims to enhance the flink application's resilience
in cases of unexpected failures in checkpoint storages like S3 or HDFS,


*[Background]*
When using self managed S3-compatible object storage, we faced checkpoint
async failures lasting for an extended period more than 30 minutes,
leading to multiple job restarts and causing lags in our streaming
application.


*[Current Behavior]*Currently, when the number of checkpoint failures
exceeds a predefined tolerable limit, flink will either restart or fail the
job based on how it's configured.
In my opinion this does not handle scenarios where the checkpoint storage
itself may be unreliable or experiencing downtime.

*[Proposed Feature]*
I propose a config that allows for a graceful job stop with a savepoint
when the tolerable checkpoint failure limit is reached.
Instead of restarting/failing the job when tolerable checkpoint failure
exceeds, when this new config is set to true just trigger stopWithSavepoint.

This could offer the following benefits.
- Indication of Checkpoint Storage State: Exceeding tolerable checkpoint
failures could indicate unstable checkpoint storage.
- Automated Fallback Strategy: When combined with a monitoring cron job,
this feature could act as an automated fallback strategy for handling
unstable checkpoint storage.
  The job would stop safely, take a savepoint, and then you could
automatically restart with different checkpoint storage configured like
switching from S3 to HDFS.

For example let's say checkpoint path is configured to s3 and savepoint
path is configured to hdfs.
When the new config is set to true the job stops with savepoint like below
when tolerable checkpoint failure exceeds.
And we can restart the job from that savepoint while the checkpoint
configured as hdfs.
[image: image.png]


Looking forward to hearing the community's thoughts on this proposal.
And also want to ask how the community is handling long lasting unstable
checkpoint storage issues.

Thanks in advance.

Best dongwoo,

[DISCUSS] Add config to enable job stop with savepoint on exceeding tolerable checkpoint Failures

Reply via email to