Thanks for the log. >From the partial log that you shared with me, my assumption is that some external resource manager is shutting down your cluster. Multiple TaskManagers are disconnecting, and finally the job is switching into failed state. It seems that you are not stopping only one TaskManger, but all of them.
Why are you restarting a TaskManager? How are you deploying Flink? On Fri, Sep 10, 2021 at 12:46 AM Puneet Duggal <puneetduggal1...@gmail.com> wrote: > Hi, > > Please find attached logfile regarding job not getting restarted on > another task manager once existing task manager got restarted. > > Just FYI - We are using Fixed Delay Restart (5 times, 10s delay) > > On Thu, Sep 9, 2021 at 4:29 PM Robert Metzger <rmetz...@apache.org> wrote: > >> Hi Puneet, >> >> Can you provide us with the JobManager logs of this incident? Jobs should >> not disappear, they should restart on other Task Managers. >> >> On Wed, Sep 8, 2021 at 3:06 PM Puneet Duggal <puneetduggal1...@gmail.com> >> wrote: >> >>> Hi, >>> >>> So for past 2-3 days i have been looking for documentation which >>> elaborates how flink takes care of restarting the data streaming job. I >>> know all the restart and failover strategies but wanted to know how >>> different components (Job Manager, Task Manager etc) play a role while >>> restarting the flink data streaming job. >>> >>> I am asking this because recently in production.. when i restarted a >>> task manger, all the jobs running on it, instead of getting restarted, >>> disappeared. Within flink UI, couldn't tack those jobs in completed jobs as >>> well. Logs also couldnt provide me with good enough information. >>> >>> Also if anyone can tell me what is the role of /tmp/executionGraphStore >>> folder in Job Manager machine. >>> >>> Thanks >>> >>> >>>