Hi,

Can you explain what "EMR cluster crashed" means in the 2nd scenario?
Can you also share:
- yarn.application-attempts in Flink
- yarn.resourcemanager.am.max-attempts in Yarn
- number of EMR master nodes (1 or 3)
- EMR version?

Regards,
Roman


On Mon, Oct 19, 2020 at 8:22 AM Averell <lvhu...@gmail.com> wrote:

> Hi,
>
> I'm trying to enable HA for my Flink jobs running on AWS EMR.
> Following [1], I created a common Flink YARN session and submitting all my
> jobs to that one. These 4 config params were added
> /    high-availability = zookeeper
>     high-availability.storageDir =
>     high-availability.zookepper.path.root = /flink
>     high-availability.zookeeper.quorum = <EMR's master node's DNS
> name>:2181
> /(The Zookeeper came with EMR was used)
>
> The command to start that Flink YARN session is like this:
> `/flink-yarn-session -Dtaskmanager.memory.process.size=4g -nm
> FlinkCommonSession -z FlinkCommonSession -d/`
>
> The first HA test - yarn application killed - went well. I killed that
> common session by using `/yarn application --kill <appId>/` and created a
> new session using the same command, then the jobs were restored
> automatically after that session was up.
>
> However, the 2nd HA test - EMR cluster crashed - didn't work: the */jobs
> are
> not restored/ *after the common session was created on the new EMR cluster.
> (attached  jobmanager.gz
> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/jobmanager.gz>
>
> )
>
> Should I expect that the jobs are restored in that scenario no.2 - EMR
> cluster crashed.
> Do I miss something here?
>
> Thanks for your help.
>
> Regards,
> Averell
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/yarn_setup.html
>
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Reply via email to