Re: Documentation for deep diving into flink (data-streaming) job restart process

Puneet Duggal Fri, 10 Sep 2021 00:48:02 -0700

Hi Robert,

Thanks for taking out time to go through the logs.

Problem:
So reason for restarting all the task managers was to incorporate increased jvm 
metaspace size for each existing task manager. Currently each taskmanager has 
32 slots. But JVM metaspace size was 256 MB which used to get filled by 
deploying 4-5 jobs (irrespective of their parallelism). Since our use case is 
generic.. worst case is that there are 32 jobs running on a single task manager.

Solution:
Basic solution was to increase JVM Metaspace size to 3GB to incorporate 32 
jobs. This required restart of all the task manager JVM with given changes. We 
had a total of 10 task managers of which 7 task managers were completely empty. 
In slot terms there were toal of 320 slots of which around 240 slots were 
availaible at restart time. First we targeted all those task managers which 
were completely empty. Once those restarted, we targeted task managers where 
job were up and running. 

Issue Faced:
First task manager we targeted, i faced above mentioned issue where jobs 
instead of going into restart phase and getting spawned on other task managers 
failed. But these failed jobs were not even listed in completed jobs section in 
Flink UI. That is why i used the term disappeared. Usually with prior 
experience, any job with terminal status gets listed in completed jobs.

Thanks

> On 10-Sep-2021, at 11:34 AM, Robert Metzger <rmetz...@apache.org> wrote:
> 
> Thanks for the log.
> 
> From the partial log that you shared with me, my assumption is that some 
> external resource manager is shutting down your cluster. Multiple 
> TaskManagers are disconnecting, and finally the job is switching into failed 
> state.
> It seems that you are not stopping only one TaskManger, but all of them.
> 
> Why are you restarting a TaskManager?
> How are you deploying Flink?
> 
> On Fri, Sep 10, 2021 at 12:46 AM Puneet Duggal <puneetduggal1...@gmail.com 
> <mailto:puneetduggal1...@gmail.com>> wrote:
> Hi,
> 
> Please find attached logfile regarding job not getting restarted on another 
> task manager once existing task manager got restarted.
> 
> Just FYI - We are using Fixed Delay Restart (5 times, 10s delay)
> 
> On Thu, Sep 9, 2021 at 4:29 PM Robert Metzger <rmetz...@apache.org 
> <mailto:rmetz...@apache.org>> wrote:
> Hi Puneet,
> 
> Can you provide us with the JobManager logs of this incident? Jobs should not 
> disappear, they should restart on other Task Managers.
> 
> On Wed, Sep 8, 2021 at 3:06 PM Puneet Duggal <puneetduggal1...@gmail.com 
> <mailto:puneetduggal1...@gmail.com>> wrote:
> Hi,
> 
> So for past 2-3 days i have been looking for documentation which elaborates 
> how flink takes care of restarting the data streaming job. I know all the 
> restart and failover strategies but wanted to know how different components 
> (Job Manager, Task Manager etc) play a role while restarting the flink data 
> streaming job. 
> 
> I am asking this because recently in production.. when i restarted a task 
> manger, all the jobs running on it, instead of getting restarted, 
> disappeared. Within flink UI, couldn't tack those jobs in completed jobs as 
> well. Logs also couldnt provide me with good enough information.
> 
> Also if anyone can tell me what is the role of /tmp/executionGraphStore  
> folder in Job Manager machine.
> 
> Thanks
> 
>

Re: Documentation for deep diving into flink (data-streaming) job restart process

Reply via email to