[jira] [Commented] (FLINK-12751) Create file based HA support

Teng Fei Liao (Jira) Thu, 17 Sep 2020 13:01:27 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197929#comment-17197929
 ]


Teng Fei Liao commented on FLINK-12751:
---------------------------------------

If you replace the deployment with a stateful set for the job manager, 
kubernetes will guarantee you exactly 1 job manager will be running. An 
additional benefit you get here is that the stateful maintains the network 
identity for the job manager. We're actually running this variant in production.

> Create file based HA support
> ----------------------------
>
>                 Key: FLINK-12751
>                 URL: https://issues.apache.org/jira/browse/FLINK-12751
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.8.0, 1.9.0, 2.0.0
>         Environment: Flink on k8 and Mini cluster
>            Reporter: Boris Lublinsky
>            Priority: Major
>              Labels: features, pull-request-available
>   Original Estimate: 168h
>          Time Spent: 10m
>  Remaining Estimate: 167h 50m
>
> In the current Flink implementation, HA support can be implemented either 
> using Zookeeper or Custom Factory class.
> Add HA implementation based on PVC. The idea behind this implementation
> is as follows:
> * Because implementation assumes a single instance of Job manager (Job 
> manager selection and restarts are done by K8 Deployment of 1)
> URL management is done using StandaloneHaServices implementation (in the case 
> of cluster) and EmbeddedHaServices implementation (in the case of mini 
> cluster)
> * For management of the submitted Job Graphs, checkpoint counter and 
> completed checkpoint an implementation is leveraging the following file 
> system layout
> {code}
>  ha -----> root of the HA data
>  checkpointcounter -----> checkpoint counter folder
>  <job ID> -----> job id folder
>  <counter file> -----> counter file
>  <another job ID> -----> another job id folder
>  ...........
>  completedCheckpoint -----> completed checkpoint folder
>  <job ID> -----> job id folder
>  <checkpoint file> -----> checkpoint file
>  <another checkpoint file> -----> checkpoint file
>  ...........
>  <another job ID> -----> another job id folder
>  ...........
>  submittedJobGraph -----> submitted graph folder
>  <job ID> -----> job id folder
>  <graph file> -----> graph file
>  <another job ID> -----> another job id folder
>  ...........
> {code}
> An implementation should overwrites 2 of the Flink files:
> * HighAvailabilityServicesUtils - added `FILESYSTEM` option for picking HA 
> service
> * HighAvailabilityMode - added `FILESYSTEM` to available HA options.
> The actual implementation adds the following classes:
> * `FileSystemHAServices` - an implementation of a `HighAvailabilityServices` 
> for file system
> * `FileSystemUtils` - support class for creation of runtime components.
> * `FileSystemStorageHelper` - file system operations implementation for 
> filesystem based HA
> * `FileSystemCheckpointRecoveryFactory` - an implementation of a 
> `CheckpointRecoveryFactory`for file system
> * `FileSystemCheckpointIDCounter` - an implementation of a 
> `CheckpointIDCounter` for file system
> * `FileSystemCompletedCheckpointStore` - an implementation of a 
> `CompletedCheckpointStore` for file system
> * `FileSystemSubmittedJobGraphStore` - an implementation of a 
> `SubmittedJobGraphStore` for file system



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-12751) Create file based HA support

Reply via email to