[ https://issues.apache.org/jira/browse/FLINK-12144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813322#comment-16813322 ]
vinoyang commented on FLINK-12144: ---------------------------------- [~gyfora] The proposal sounds like the discussion under FLINK-11159, right? > Option to prefer checkpoints on recovery > ---------------------------------------- > > Key: FLINK-12144 > URL: https://issues.apache.org/jira/browse/FLINK-12144 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing > Reporter: Gyula Fora > Priority: Trivial > > When a streaming job fails the getLatestCheckpoint() of the CheckpointStore > is used to determine which checkpoint or savepoint is going to be used for > recovery. > This behaviour is perfectly fine for jobs with relatively small states or > where there are no strong SLAs but it some cases it can be problematic. > For jobs with a very large state size, the difference between recovery times > from savepoints and checkpoints can be substantial to the point where it > might break a use-case. So we would like to avoid ever recovering from a > savepoint if a not too old checkpoint is also readily available. > This cannot be avoided right now if a job fails after we took a savepoint > maybe for backup purposes (maybe it is scheduled multiple times a day). > I suggest we add a configuration option to allow the job to fall back to an > earlier checkpoint (within maybe a certain age limit) even if there is a > newer savepoint available to avoid lengthy downtimes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)