Hi, Can you explain what "EMR cluster crashed" means in the 2nd scenario? Can you also share: - yarn.application-attempts in Flink - yarn.resourcemanager.am.max-attempts in Yarn - number of EMR master nodes (1 or 3) - EMR version?
Regards, Roman On Mon, Oct 19, 2020 at 8:22 AM Averell <lvhu...@gmail.com> wrote: > Hi, > > I'm trying to enable HA for my Flink jobs running on AWS EMR. > Following [1], I created a common Flink YARN session and submitting all my > jobs to that one. These 4 config params were added > / high-availability = zookeeper > high-availability.storageDir = > high-availability.zookepper.path.root = /flink > high-availability.zookeeper.quorum = <EMR's master node's DNS > name>:2181 > /(The Zookeeper came with EMR was used) > > The command to start that Flink YARN session is like this: > `/flink-yarn-session -Dtaskmanager.memory.process.size=4g -nm > FlinkCommonSession -z FlinkCommonSession -d/` > > The first HA test - yarn application killed - went well. I killed that > common session by using `/yarn application --kill <appId>/` and created a > new session using the same command, then the jobs were restored > automatically after that session was up. > > However, the 2nd HA test - EMR cluster crashed - didn't work: the */jobs > are > not restored/ *after the common session was created on the new EMR cluster. > (attached jobmanager.gz > < > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/jobmanager.gz> > > ) > > Should I expect that the jobs are restored in that scenario no.2 - EMR > cluster crashed. > Do I miss something here? > > Thanks for your help. > > Regards, > Averell > > [1] > > https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/yarn_setup.html > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >