[ https://issues.apache.org/jira/browse/FLINK-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15907156#comment-15907156 ]
ASF GitHub Bot commented on FLINK-5823: --------------------------------------- GitHub user StephanEwen opened a pull request: https://github.com/apache/flink/pull/3522 [FLINK-5823] [checkpoints] State Backends also handle Checkpoint Metadata and introduce Global Cleanup Hooks ## Core Changes ### Store Metadata in State Backend The state backend is now responsible for storing the checkpoint metadata. There is no implicit assumption that the checkpoint metadata is stored in a file systems any more. - All checkpoint directory / savepoint directory specific config settings are now part of the state backends. The Checkpoint Coordinator simply calls the relevant methods on the state backends to store metadata. - Similar as the `CheckpointStreamFactory` for storing checkpoint state, there is now a `CheckpointMetadataStreamFactory` for the metadata. - State backends are not required to be able to persist metadata - only ifor HA setups and when externalized checkpoints are requested. - All checkpoints with persisted metadata are addressable via a "pointer", which is state-backend specific. For File-system based statebackends (all statebackends in Flink currently), this pointer is the file path. ### Global cleanup hooks State backends can implement an extended interface to create global cleanup hooks. For example for a file system, a global cleanup hook simply recursively deletes the checkpoint directory, which is for most file systems much faster than issuing a delete call per file. The `MemoryStateBackend` and `FsStateBackend` use fast cleanup hooks, the `RocksDBStateBackend` should get then in a followup (see below). ### Application-defined State Backends pick up additional values from the configuration We need to keep supporting the scenario of setting a state backends in the user program, but configuring parameters like checkpoint directory in the cluster config. To support that, state backends may implement an additional interface which lets them pick up configuration values from the cluster configuration. ## Altered user-facing Behavior - Externalized checkpoints store all files (state and metadata) strictly in the same directory now. (Savepoints were contained in a single directory already before) - Both savepoint commands and configuration parameters now require qualified URIs as well (i.e., `file:///path/do/savepoint`, whereas before the configs and command line also excepted `/path/do/savepoint`. Because not having qualified URIs is error-prone anyways (auto fallback to local file system) I am actually in favor of doing this change. ## Tests This adds a lot of tests, which can due to the changed design be done completely on the state backends, without instantiating a CheckpointCoordinator. - Checkpoint / Savepoint delete with global hook works as expected - Interaction of old cleanup logic and optional global cleanup hook - State backends are properly loaded from configuration - Application-configured State backends properly pick up additional configuration values ## Followups - Implement the global hooks for the RocksDB state backend. Because the RocksDB state backend internally delegates to a "storage backend", this need to few extra tricks and was not done in this already large pull request. - The HA checkpoint store needs not store checkpoint metadata itself any more, it can simply store the pointer. - This abstraction allows state backends to over periodic cleanup hooks that can search for lost state (files that are not referenced any more), which can happen when TaskManager / JobManager processes fail during finalizing a checkpoint or when handing over state between JobManager / TaskManager. You can merge this pull request into a Git repository by running: $ git pull https://github.com/StephanEwen/incubator-flink filestatebackend Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3522.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3522 ---- commit 2bd817286f89693af4bb98504091afc1f45749ad Author: Stephan Ewen <se...@apache.org> Date: 2017-03-01T22:00:55Z [FLINK-5823] [checkpoints] State Backends also handle Checkpoint Metadata ---- > Store Checkpoint Root Metadata in StateBackend (not in HA custom store) > ----------------------------------------------------------------------- > > Key: FLINK-5823 > URL: https://issues.apache.org/jira/browse/FLINK-5823 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing > Reporter: Stephan Ewen > Assignee: Stephan Ewen > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)