Re: Documentation for deep diving into flink (data-streaming) job restart process

Puneet Duggal Sun, 12 Sep 2021 08:59:50 -0700

Hi Robert,

Any solution / alternate approach to above issue would be appreciated as
going live with new jobs will be unreliable w.r.t task manager going down.


On Fri, Sep 10, 2021 at 1:17 PM Puneet Duggal <puneetduggal1...@gmail.com>
wrote:

> Hi Robert,
>
> Thanks for taking out time to go through the logs.
>
> Problem:
> So reason for restarting all the task managers was to incorporate
> increased jvm metaspace size for each existing task manager. Currently each
> taskmanager has 32 slots. But JVM metaspace size was 256 MB which used to
> get filled by deploying 4-5 jobs (irrespective of their parallelism). Since
> our use case is generic.. worst case is that there are 32 jobs running on a
> single task manager.
>
> Solution:
> Basic solution was to increase JVM Metaspace size to 3GB to incorporate 32
> jobs. This required restart of all the task manager JVM with given changes.
> We had a total of 10 task managers of which 7 task managers were completely
> empty. In slot terms there were toal of 320 slots of which around 240 slots
> were availaible at restart time. First we targeted all those task managers
> which were completely empty. Once those restarted, we targeted task
> managers where job were up and running.
>
> Issue Faced:
> First task manager we targeted, i faced above mentioned issue where jobs
> instead of going into restart phase and getting spawned on other task
> managers failed. But these failed jobs were not even listed in completed
> jobs section in Flink UI. That is why i used the term disappeared. Usually
> with prior experience, any job with terminal status gets listed in
> completed jobs.
>
> Thanks
>
> On 10-Sep-2021, at 11:34 AM, Robert Metzger <rmetz...@apache.org> wrote:
>
> Thanks for the log.
>
> From the partial log that you shared with me, my assumption is that some
> external resource manager is shutting down your cluster. Multiple
> TaskManagers are disconnecting, and finally the job is switching into
> failed state.
> It seems that you are not stopping only one TaskManger, but all of them.
>
> Why are you restarting a TaskManager?
> How are you deploying Flink?
>
> On Fri, Sep 10, 2021 at 12:46 AM Puneet Duggal <puneetduggal1...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Please find attached logfile regarding job not getting restarted on
>> another task manager once existing task manager got restarted.
>>
>> Just FYI - We are using Fixed Delay Restart (5 times, 10s delay)
>>
>> On Thu, Sep 9, 2021 at 4:29 PM Robert Metzger <rmetz...@apache.org>
>> wrote:
>>
>>> Hi Puneet,
>>>
>>> Can you provide us with the JobManager logs of this incident? Jobs
>>> should not disappear, they should restart on other Task Managers.
>>>
>>> On Wed, Sep 8, 2021 at 3:06 PM Puneet Duggal <puneetduggal1...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> So for past 2-3 days i have been looking for documentation which
>>>> elaborates how flink takes care of restarting the data streaming job. I
>>>> know all the restart and failover strategies but wanted to know how
>>>> different components (Job Manager, Task Manager etc) play a role while
>>>> restarting the flink data streaming job.
>>>>
>>>> I am asking this because recently in production.. when i restarted a
>>>> task manger, all the jobs running on it, instead of getting restarted,
>>>> disappeared. Within flink UI, couldn't tack those jobs in completed jobs as
>>>> well. Logs also couldnt provide me with good enough information.
>>>>
>>>> Also if anyone can tell me what is the role of
>>>> /tmp/executionGraphStore  folder in Job Manager machine.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>

Re: Documentation for deep diving into flink (data-streaming) job restart process

Reply via email to