Hi, I'm trying to enable HA for my Flink jobs running on AWS EMR. Following [1], I created a common Flink YARN session and submitting all my jobs to that one. These 4 config params were added / high-availability = zookeeper high-availability.storageDir = high-availability.zookepper.path.root = /flink high-availability.zookeeper.quorum = <EMR's master node's DNS name>:2181 /(The Zookeeper came with EMR was used)
The command to start that Flink YARN session is like this: `/flink-yarn-session -Dtaskmanager.memory.process.size=4g -nm FlinkCommonSession -z FlinkCommonSession -d/` The first HA test - yarn application killed - went well. I killed that common session by using `/yarn application --kill <appId>/` and created a new session using the same command, then the jobs were restored automatically after that session was up. However, the 2nd HA test - EMR cluster crashed - didn't work: the */jobs are not restored/ *after the common session was created on the new EMR cluster. (attached jobmanager.gz <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/jobmanager.gz> ) Should I expect that the jobs are restored in that scenario no.2 - EMR cluster crashed. Do I miss something here? Thanks for your help. Regards, Averell [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/yarn_setup.html -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/