[ https://issues.apache.org/jira/browse/FLINK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197929#comment-17197929 ]
Teng Fei Liao commented on FLINK-12751: --------------------------------------- If you replace the deployment with a stateful set for the job manager, kubernetes will guarantee you exactly 1 job manager will be running. An additional benefit you get here is that the stateful maintains the network identity for the job manager. We're actually running this variant in production. > Create file based HA support > ---------------------------- > > Key: FLINK-12751 > URL: https://issues.apache.org/jira/browse/FLINK-12751 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination > Affects Versions: 1.8.0, 1.9.0, 2.0.0 > Environment: Flink on k8 and Mini cluster > Reporter: Boris Lublinsky > Priority: Major > Labels: features, pull-request-available > Original Estimate: 168h > Time Spent: 10m > Remaining Estimate: 167h 50m > > In the current Flink implementation, HA support can be implemented either > using Zookeeper or Custom Factory class. > Add HA implementation based on PVC. The idea behind this implementation > is as follows: > * Because implementation assumes a single instance of Job manager (Job > manager selection and restarts are done by K8 Deployment of 1) > URL management is done using StandaloneHaServices implementation (in the case > of cluster) and EmbeddedHaServices implementation (in the case of mini > cluster) > * For management of the submitted Job Graphs, checkpoint counter and > completed checkpoint an implementation is leveraging the following file > system layout > {code} > ha -----> root of the HA data > checkpointcounter -----> checkpoint counter folder > <job ID> -----> job id folder > <counter file> -----> counter file > <another job ID> -----> another job id folder > ........... > completedCheckpoint -----> completed checkpoint folder > <job ID> -----> job id folder > <checkpoint file> -----> checkpoint file > <another checkpoint file> -----> checkpoint file > ........... > <another job ID> -----> another job id folder > ........... > submittedJobGraph -----> submitted graph folder > <job ID> -----> job id folder > <graph file> -----> graph file > <another job ID> -----> another job id folder > ........... > {code} > An implementation should overwrites 2 of the Flink files: > * HighAvailabilityServicesUtils - added `FILESYSTEM` option for picking HA > service > * HighAvailabilityMode - added `FILESYSTEM` to available HA options. > The actual implementation adds the following classes: > * `FileSystemHAServices` - an implementation of a `HighAvailabilityServices` > for file system > * `FileSystemUtils` - support class for creation of runtime components. > * `FileSystemStorageHelper` - file system operations implementation for > filesystem based HA > * `FileSystemCheckpointRecoveryFactory` - an implementation of a > `CheckpointRecoveryFactory`for file system > * `FileSystemCheckpointIDCounter` - an implementation of a > `CheckpointIDCounter` for file system > * `FileSystemCompletedCheckpointStore` - an implementation of a > `CompletedCheckpointStore` for file system > * `FileSystemSubmittedJobGraphStore` - an implementation of a > `SubmittedJobGraphStore` for file system -- This message was sent by Atlassian Jira (v8.3.4#803005)