Re: Graceful Task Manager Termination and Replacement

Biao Liu Mon, 29 Jul 2019 19:52:36 -0700

Hi Yu,

That's a great proposal. Wish to see this feature soon!


On Mon, Jul 29, 2019 at 4:59 PM Yu Li <car...@gmail.com> wrote:

> Belated but FWIW, besides the region failover and best-efforts failover
> efforts, I believe stop with checkpoint as proposed in FLINK-12619 and
> FLIP-45 could also help here, FYI.
>
> W.r.t k8s, there're also some offline discussion about supporting local
> recovery with persistent volume even when task assigned to other TMs during
> job failover.
>
> [1] https://issues.apache.org/jira/browse/FLINK-12619
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic
>
> Best Regards,
> Yu
>
>
> On Wed, 24 Jul 2019 at 17:00, Aaron Levin <aaronle...@stripe.com> wrote:
>
>> I was on vacation but wanted to thank Biao for summarizing the current
>> state! Thanks!
>>
>> On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <mmyy1...@gmail.com> wrote:
>>
>>> Hi Aaron,
>>>
>>> From my understanding, you want shutting down a Task Manager without
>>> restart the job which has tasks running on this Task Manager?
>>>
>>> Based on current implementation, if there is a Task Manager is down, the
>>> tasks on it would be treated as failed. The behavior of task failure is
>>> defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
>>> That's the reason why the whole job restarts when a Task Manager has
>>> gone. As Paul said, you could try "region restart failover strategy" when
>>> 1.9 is released. It might be helpful however it depends on your job
>>> topology.
>>>
>>> The deeper reason of this issue is the consistency semantics of Flink,
>>> AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there
>>> is no much choice of `FailoverStrategy`.
>>> It might be improved in the future. There are some discussions in the
>>> mailing list that providing some weaker consistency semantics to improve
>>> the `FailoverStrategy`. We are pushing forward this improvement. I hope it
>>> can be included in 1.10.
>>>
>>> Regarding your question, I guess the answer is no for now. A more
>>> frequent checkpoint or a savepoint manually triggered might be helpful by a
>>> quicker recovery.
>>>
>>>
>>> Paul Lam <paullin3...@gmail.com> 于2019年7月12日周五 上午10:25写道：
>>>
>>>> Hi,
>>>>
>>>> Maybe region restart strategy can help. It restarts minimum required
>>>> tasks. Note that it’s recommended to use only after 1.9 release, see [1],
>>>> unless you’re running a stateless job.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-10712
>>>>
>>>> Best,
>>>> Paul Lam
>>>>
>>>> 在 2019年7月12日，03:38，Aaron Levin <aaronle...@stripe.com> 写道：
>>>>
>>>> Hello,
>>>>
>>>> Is there a way to gracefully terminate a Task Manager beyond just
>>>> killing it (this seems to be what `./taskmanager.sh stop` does)?
>>>> Specifically I'm interested in a way to replace a Task Manager that has
>>>> currently-running tasks. It would be great if it was possible to terminate
>>>> a Task Manager without restarting the job, though I'm not sure if this is
>>>> possible.
>>>>
>>>> Context: at my work we regularly cycle our hosts for maintenance and
>>>> security. Each time we do this we stop the task manager running on the host
>>>> being cycled. This causes the entire job to restart, resulting in downtime
>>>> for the job. I'd love to decrease this downtime if at all possible.
>>>>
>>>> Thanks! Any insight is appreciated!
>>>>
>>>> Best,
>>>>
>>>> Aaron Levin
>>>>
>>>>
>>>>

Re: Graceful Task Manager Termination and Replacement

Reply via email to