Hi Robert, Any solution / alternate approach to above issue would be appreciated as going live with new jobs will be unreliable w.r.t task manager going down.
On Fri, Sep 10, 2021 at 1:17 PM Puneet Duggal <puneetduggal1...@gmail.com> wrote: > Hi Robert, > > Thanks for taking out time to go through the logs. > > Problem: > So reason for restarting all the task managers was to incorporate > increased jvm metaspace size for each existing task manager. Currently each > taskmanager has 32 slots. But JVM metaspace size was 256 MB which used to > get filled by deploying 4-5 jobs (irrespective of their parallelism). Since > our use case is generic.. worst case is that there are 32 jobs running on a > single task manager. > > Solution: > Basic solution was to increase JVM Metaspace size to 3GB to incorporate 32 > jobs. This required restart of all the task manager JVM with given changes. > We had a total of 10 task managers of which 7 task managers were completely > empty. In slot terms there were toal of 320 slots of which around 240 slots > were availaible at restart time. First we targeted all those task managers > which were completely empty. Once those restarted, we targeted task > managers where job were up and running. > > Issue Faced: > First task manager we targeted, i faced above mentioned issue where jobs > instead of going into restart phase and getting spawned on other task > managers failed. But these failed jobs were not even listed in completed > jobs section in Flink UI. That is why i used the term disappeared. Usually > with prior experience, any job with terminal status gets listed in > completed jobs. > > Thanks > > On 10-Sep-2021, at 11:34 AM, Robert Metzger <rmetz...@apache.org> wrote: > > Thanks for the log. > > From the partial log that you shared with me, my assumption is that some > external resource manager is shutting down your cluster. Multiple > TaskManagers are disconnecting, and finally the job is switching into > failed state. > It seems that you are not stopping only one TaskManger, but all of them. > > Why are you restarting a TaskManager? > How are you deploying Flink? > > On Fri, Sep 10, 2021 at 12:46 AM Puneet Duggal <puneetduggal1...@gmail.com> > wrote: > >> Hi, >> >> Please find attached logfile regarding job not getting restarted on >> another task manager once existing task manager got restarted. >> >> Just FYI - We are using Fixed Delay Restart (5 times, 10s delay) >> >> On Thu, Sep 9, 2021 at 4:29 PM Robert Metzger <rmetz...@apache.org> >> wrote: >> >>> Hi Puneet, >>> >>> Can you provide us with the JobManager logs of this incident? Jobs >>> should not disappear, they should restart on other Task Managers. >>> >>> On Wed, Sep 8, 2021 at 3:06 PM Puneet Duggal <puneetduggal1...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> So for past 2-3 days i have been looking for documentation which >>>> elaborates how flink takes care of restarting the data streaming job. I >>>> know all the restart and failover strategies but wanted to know how >>>> different components (Job Manager, Task Manager etc) play a role while >>>> restarting the flink data streaming job. >>>> >>>> I am asking this because recently in production.. when i restarted a >>>> task manger, all the jobs running on it, instead of getting restarted, >>>> disappeared. Within flink UI, couldn't tack those jobs in completed jobs as >>>> well. Logs also couldnt provide me with good enough information. >>>> >>>> Also if anyone can tell me what is the role of >>>> /tmp/executionGraphStore folder in Job Manager machine. >>>> >>>> Thanks >>>> >>>> >>>> >