[ https://issues.apache.org/jira/browse/FLINK-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380212#comment-16380212 ]
ASF GitHub Bot commented on FLINK-7805: --------------------------------------- GitHub user GJL opened a pull request: https://github.com/apache/flink/pull/5597 [FLINK-7805][flip6] Add HA capabilities to YarnResourceManager ## What is the purpose of the change *Recover previously running containers after a restart of the ApplicationMaster. This is a port of a feature that was already implemented prior to FLIP-6.* cc: @tillrohrmann ## Brief change log - *Extract `RegisterApplicationMasterResponseReflector` class into separate file.* - *Use `RegisterApplicationMasterResponseReflector` from within `YarnResourceManager`* ## Verifying this change This change added tests and can be verified as follows: - *Added unit tests for `RegisterApplicationMasterResponseReflector`* - *Manually deployed a cluster on YARN with HA enabled. Submitted a job, and killed the master several times. Verified that the right log messages were generated ("Recovered X containers from previous attempts"), and that the taskmanager resource ids remained the same.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented) You can merge this pull request into a Git repository by running: $ git pull https://github.com/GJL/flink FLINK-7805-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5597.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5597 ---- commit 012f656705466cea4ba2e037f7e52f46c0a1bf9f Author: gyao <gary@...> Date: 2018-02-27T15:58:53Z [FLINK-8787][flip6] Do not copy flinkConfiguration in AbstractYarnClusterDescriptor commit bf8211112c16a16bd29aba754c1c34f4de960af4 Author: gyao <gary@...> Date: 2018-02-28T12:04:19Z [hotfix] Add missing space to log message in ZooKeeperLeaderElectionService commit a435c6503dcd8e65e34b807d5dc1c7045911e788 Author: gyao <gary@...> Date: 2018-02-28T12:06:00Z [hotfix][Javadoc] Fix typo in YARN Utils: teh -> the commit 3e4c28c38cc04aa29bdb8928fa26daf3c4ab1e69 Author: gyao <gary@...> Date: 2018-02-28T12:07:04Z [hotfix][Javadoc] Fix typo in YarnTestBase: teh -> the commit 6b1efdb290c605d93f17fc0aadfb07205eaf60fd Author: gyao <gary@...> Date: 2018-02-28T12:08:25Z [hotfix][tests] Fix wrong assertEquals in YARNSessionCapacitySchedulerITCase Test swapped actual and expected arguments. Remove catching Throwable in test; instead propagate all exceptions. commit 615c19acfc87bad021b3dce02b6dc5fac2aac784 Author: gyao <gary@...> Date: 2018-02-28T12:20:23Z [FLINK-7805][flip6] Recover YARN containers after AM restart. Recover previously running containers after a restart of the ApplicationMaster. This is a port of a feature that was already implemented prior to FLIP-6. Extract RegisterApplicationMasterResponseReflector class into separate file. ---- > Add HA capabilities to YarnResourceManager > ------------------------------------------ > > Key: FLINK-7805 > URL: https://issues.apache.org/jira/browse/FLINK-7805 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, YARN > Affects Versions: 1.4.0 > Reporter: Till Rohrmann > Assignee: Gary Yao > Priority: Major > Labels: flip-6 > > The new {{YarnResourceManager}} implementation does not retrieve allocated > containers from previous attempts in HA mode like the old > {{YarnFlinkResourceManager}} did. We should add this functionality in order > to properly support long running Yarn applications [1]. > [1] > https://de.hortonworks.com/blog/apache-hadoop-yarn-hdp-2-2-fault-tolerance-features-long-running-services/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)