[ https://issues.apache.org/jira/browse/FLINK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrey Zagrebin updated FLINK-9661: ----------------------------------- Summary: TTL state should support to do time shift after restoring from checkpoint (savepoint). (was: TTL state should support to do time shift after restoring from checkpoint( savepoint).) > TTL state should support to do time shift after restoring from checkpoint > (savepoint). > -------------------------------------------------------------------------------------- > > Key: FLINK-9661 > URL: https://issues.apache.org/jira/browse/FLINK-9661 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.6.0 > Reporter: Sihua Zhou > Priority: Major > > The initial version of the TTL-state appends the expired timestamp along the > state record, and check the expired timestamp with the condition > {{expired_timestamp <= current_time}} when accessing the state, if it is true > then the record is expired, otherwise it is still alive. This could works > pretty fine in the most cases, but in some case, we need to do time shift, > otherwise it may cause some unexpected result when using the ProccessTime, I > roughly describe two case as follow. > - when restoring the job from the savepoint > For example, the user set the TTL to 2h for the state, if he trigger a > savepoint and restore the job from the savepoint after 2h(maybe some reason > that delay he to restore the job quickly), then the restored job's previous > state data are all expired. > - when the job spend a long time to recover from a failure > For example, there are many jobs running on a yarn session cluster, and the > cluster configured to use the DFS to store the checkpoint data, but > unfortunately, the DFS meet a strange problem which makes the jobs on the > cluster begin to loop in recovery-fail-recovery-fail... the devs spend some > time to address the issue of DFS and the jobs start working properly, but if > the "{{system down time >= TTL}}" then the job's previous state data will be > expired in this case. > To avoid the problems as above, we need to do time shift after the job > recovering from checkpoint & savepoint. A possible approach is outlined in > [6186|https://github.com/apache/flink/pull/6186]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)