Belated but FWIW, besides the region failover and best-efforts failover
efforts, I believe stop with checkpoint as proposed in FLINK-12619 and
FLIP-45 could also help here, FYI.

W.r.t k8s, there're also some offline discussion about supporting local
recovery with persistent volume even when task assigned to other TMs during
job failover.

[1] https://issues.apache.org/jira/browse/FLINK-12619
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic

Best Regards,
Yu


On Wed, 24 Jul 2019 at 17:00, Aaron Levin <aaronle...@stripe.com> wrote:

> I was on vacation but wanted to thank Biao for summarizing the current
> state! Thanks!
>
> On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <mmyy1...@gmail.com> wrote:
>
>> Hi Aaron,
>>
>> From my understanding, you want shutting down a Task Manager without
>> restart the job which has tasks running on this Task Manager?
>>
>> Based on current implementation, if there is a Task Manager is down, the
>> tasks on it would be treated as failed. The behavior of task failure is
>> defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
>> That's the reason why the whole job restarts when a Task Manager has
>> gone. As Paul said, you could try "region restart failover strategy" when
>> 1.9 is released. It might be helpful however it depends on your job
>> topology.
>>
>> The deeper reason of this issue is the consistency semantics of Flink,
>> AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there
>> is no much choice of `FailoverStrategy`.
>> It might be improved in the future. There are some discussions in the
>> mailing list that providing some weaker consistency semantics to improve
>> the `FailoverStrategy`. We are pushing forward this improvement. I hope it
>> can be included in 1.10.
>>
>> Regarding your question, I guess the answer is no for now. A more
>> frequent checkpoint or a savepoint manually triggered might be helpful by a
>> quicker recovery.
>>
>>
>> Paul Lam <paullin3...@gmail.com> 于2019年7月12日周五 上午10:25写道:
>>
>>> Hi,
>>>
>>> Maybe region restart strategy can help. It restarts minimum required
>>> tasks. Note that it’s recommended to use only after 1.9 release, see [1],
>>> unless you’re running a stateless job.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10712
>>>
>>> Best,
>>> Paul Lam
>>>
>>> 在 2019年7月12日,03:38,Aaron Levin <aaronle...@stripe.com> 写道:
>>>
>>> Hello,
>>>
>>> Is there a way to gracefully terminate a Task Manager beyond just
>>> killing it (this seems to be what `./taskmanager.sh stop` does)?
>>> Specifically I'm interested in a way to replace a Task Manager that has
>>> currently-running tasks. It would be great if it was possible to terminate
>>> a Task Manager without restarting the job, though I'm not sure if this is
>>> possible.
>>>
>>> Context: at my work we regularly cycle our hosts for maintenance and
>>> security. Each time we do this we stop the task manager running on the host
>>> being cycled. This causes the entire job to restart, resulting in downtime
>>> for the job. I'd love to decrease this downtime if at all possible.
>>>
>>> Thanks! Any insight is appreciated!
>>>
>>> Best,
>>>
>>> Aaron Levin
>>>
>>>
>>>

Reply via email to