Hi Terry Wang, So adding to above provided context.. whenever task manager goes down, jobs go into failed state and do not restart. Even though there are good enough free slots available on other task manager to get restarted on.
Regards, Puneet > On 04-Mar-2022, at 4:54 PM, Terry Wang <zjuwa...@gmail.com> wrote: > > Hi, Puneet~ > > AFAIK, that should be expected behavior that jobs on crashed TaskManager > restarts. HA means there is no single point risk but Flink job still need to > through failover to ensure state and data consistency. You may refer > https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/task_failure_recovery/ > > <https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/task_failure_recovery/> > for more details. > > On Fri, Mar 4, 2022 at 2:50 AM Puneet Duggal <puneetduggal1...@gmail.com > <mailto:puneetduggal1...@gmail.com>> wrote: > Hi, > > Currently in production, i have HA session mode flink cluster with 3 job > managers and multiple task managers with more than enough free task slots. > But i have seen multiple times that whenever task manager goes down ( e.g. > due to heartbeat issue).. so does all the jobs running on it even when there > are standby task managers availaible with free slots to run them on. Has > anyone faced this issue? > > Regards, > Puneet > > > -- > Best Regards, > Terry Wang