Ufuk Celebi created FLINK-5763:
----------------------------------

             Summary: Make savepoints self-contained and relocatable
                 Key: FLINK-5763
                 URL: https://issues.apache.org/jira/browse/FLINK-5763
             Project: Flink
          Issue Type: Improvement
          Components: State Backends, Checkpointing
            Reporter: Ufuk Celebi
            Assignee: Ufuk Celebi


After a user has triggered a savepoint, a single savepoint file will be 
returned as a handle to the savepoint. A savepoint to {{<target>}} creates a 
savepoint file like {{<target>/savepoint-<randomSuffix>}}.

This file contains the metadata of the corresponding checkpoint, but not the 
actual program state. While this works well for short term management 
(pause-and-resume a job), it makes it hard to manage savepoints over longer 
periods of time.

h4. Problems

h5. Scattered Checkpoint Files

For file system based checkpoints (FsStateBackend, RocksDBStateBackend) this 
results in the savepoint referencing files from the checkpoint directory 
(usually different than <target>). For users, it is virtually impossible to 
tell which checkpoint files belong to a savepoint and which are lingering 
around. This can easily lead to accidentally invalidating a savepoint by 
deleting checkpoint files.

h5. Savepoints Not Relocatable

Even if a user is able to figure out which checkpoint files belong to a 
savepoint, moving these files will invalidate the savepoint as well, because 
the metadata file references absolute file paths.

h5. Forced to Use CLI for Disposal

Because of the scattered files, the user is in practice forced to use Flink’s 
CLI to dispose a savepoint. This should be possible to handle in the scope of 
the user’s environment via a file system delete operation.

h4. Proposal

In order to solve the described problems, savepoints should contain all their 
state, both metadata and program state, inside a single directory. Furthermore 
the metadata must only hold relative references to the checkpoint files. This 
makes it obvious which files make up the state of a savepoint and it is 
possible to move savepoints around by moving the savepoint directory.

h5. Desired File Layout

Triggering a savepoint to {{<target>}} creates a directory as follows:

{code}
<target>/savepoint-<jobId>-<randomSuffix>
  +-- _metadata
  +-- data-<randomSuffix> [1 or more]
{code}

We include the JobID in the savepoint directory name in order to give some 
hints about which job a savepoint belongs to.

h5. CLI

- Trigger: When triggering a savepoint to {{<target>}} the savepoint directory 
will be returned as the handle to the savepoint.
- Restore: Users can restore by pointing to the directory or the _metadata 
file. The data files should be required to be in the same directory as the 
_metadata file.
- Dispose: The disposal command should be deprecated and eventually removed. 
While deprecated, disposal can happen by specifying the directory or the 
_metadata file (same as restore).




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to