Hi Haibo, thanks for tip, I almost forgot about max-attempts. I understood implication of running with one AM.
Maybe my question was incorrect, but what would be faster (with regards to downtime of each job): 1. In case of yarn-session: Parallel cancel all jobs with savepoints, restart yarn-session, parallel start all jobs from savepoints 2. In case of per-job mode Parallel cancel all jobs with savepoints, parallel start all jobs from savepoints. I want to optimise standard situation where I deploy new version of all my jobs. My current impression that job starts faster in yarn-session mode. Thanks, Maxim. On Thu, Jul 18, 2019 at 4:57 AM Haibo Sun <sunhaib...@163.com> wrote: > Hi, Maxim > > For the concern talking on the first point: > If HA and checkpointing are enabled, AM (the application master, that is > the job manager you said) will be restarted by YARN after it dies, and then > the dispatcher will try to restore all the previously running jobs > correctly. Note that the number of attempts be decided by the > configurations "yarn.resourcemanager.am.max-attempts" and > "yarn.application-attempts". The obvious difference between the session and > per-job modes is that if a fatal error occurs on AM, it will affect all > jobs running in it, while the per-job mode will only affect one job. > > You can look at this document to see how to configure HA for the Flink > cluster on YARN: > https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/jobmanager_high_availability.html#yarn-cluster-high-availability > . > > Best, > Haibo > > At 2019-07-17 23:53:15, "Maxim Parkachov" <lazy.gop...@gmail.com> wrote: > > Hi, > > I'm looking for advice on how to run flink streaming jobs on Yarn cluster > in production environment. I tried in testing environment both approaches > with HA mode, namely yarn session + multiple jobs vs cluster per job, both > seems to work for my cases, with slight preference of yarn session mode to > centrally manage credentials. I'm looking to run about 10 streaming jobs > mostly reading/writing from kafka + cassandra with following restictions: > 1. yarn nodes will be hard rebooted quite often, roughly every 2 weeks. I > have a concern here what happens when Job manager dies in session mode. > 2. there are often network interruptions/slowdowns. > 3. I'm trying to minimise time to restart job to have as much as possible > continious processing. > > Thanks in advance, > Maxim. > >