Belated but FWIW, besides the region failover and best-efforts failover efforts, I believe stop with checkpoint as proposed in FLINK-12619 and FLIP-45 could also help here, FYI.
W.r.t k8s, there're also some offline discussion about supporting local recovery with persistent volume even when task assigned to other TMs during job failover. [1] https://issues.apache.org/jira/browse/FLINK-12619 [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic Best Regards, Yu On Wed, 24 Jul 2019 at 17:00, Aaron Levin <aaronle...@stripe.com> wrote: > I was on vacation but wanted to thank Biao for summarizing the current > state! Thanks! > > On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <mmyy1...@gmail.com> wrote: > >> Hi Aaron, >> >> From my understanding, you want shutting down a Task Manager without >> restart the job which has tasks running on this Task Manager? >> >> Based on current implementation, if there is a Task Manager is down, the >> tasks on it would be treated as failed. The behavior of task failure is >> defined via `FailoverStrategy` which is `RestartAllStrategy` by default. >> That's the reason why the whole job restarts when a Task Manager has >> gone. As Paul said, you could try "region restart failover strategy" when >> 1.9 is released. It might be helpful however it depends on your job >> topology. >> >> The deeper reason of this issue is the consistency semantics of Flink, >> AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there >> is no much choice of `FailoverStrategy`. >> It might be improved in the future. There are some discussions in the >> mailing list that providing some weaker consistency semantics to improve >> the `FailoverStrategy`. We are pushing forward this improvement. I hope it >> can be included in 1.10. >> >> Regarding your question, I guess the answer is no for now. A more >> frequent checkpoint or a savepoint manually triggered might be helpful by a >> quicker recovery. >> >> >> Paul Lam <paullin3...@gmail.com> 于2019年7月12日周五 上午10:25写道: >> >>> Hi, >>> >>> Maybe region restart strategy can help. It restarts minimum required >>> tasks. Note that it’s recommended to use only after 1.9 release, see [1], >>> unless you’re running a stateless job. >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10712 >>> >>> Best, >>> Paul Lam >>> >>> 在 2019年7月12日,03:38,Aaron Levin <aaronle...@stripe.com> 写道: >>> >>> Hello, >>> >>> Is there a way to gracefully terminate a Task Manager beyond just >>> killing it (this seems to be what `./taskmanager.sh stop` does)? >>> Specifically I'm interested in a way to replace a Task Manager that has >>> currently-running tasks. It would be great if it was possible to terminate >>> a Task Manager without restarting the job, though I'm not sure if this is >>> possible. >>> >>> Context: at my work we regularly cycle our hosts for maintenance and >>> security. Each time we do this we stop the task manager running on the host >>> being cycled. This causes the entire job to restart, resulting in downtime >>> for the job. I'd love to decrease this downtime if at all possible. >>> >>> Thanks! Any insight is appreciated! >>> >>> Best, >>> >>> Aaron Levin >>> >>> >>>