Hi Yu, That's a great proposal. Wish to see this feature soon!
On Mon, Jul 29, 2019 at 4:59 PM Yu Li <car...@gmail.com> wrote: > Belated but FWIW, besides the region failover and best-efforts failover > efforts, I believe stop with checkpoint as proposed in FLINK-12619 and > FLIP-45 could also help here, FYI. > > W.r.t k8s, there're also some offline discussion about supporting local > recovery with persistent volume even when task assigned to other TMs during > job failover. > > [1] https://issues.apache.org/jira/browse/FLINK-12619 > [2] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic > > Best Regards, > Yu > > > On Wed, 24 Jul 2019 at 17:00, Aaron Levin <aaronle...@stripe.com> wrote: > >> I was on vacation but wanted to thank Biao for summarizing the current >> state! Thanks! >> >> On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <mmyy1...@gmail.com> wrote: >> >>> Hi Aaron, >>> >>> From my understanding, you want shutting down a Task Manager without >>> restart the job which has tasks running on this Task Manager? >>> >>> Based on current implementation, if there is a Task Manager is down, the >>> tasks on it would be treated as failed. The behavior of task failure is >>> defined via `FailoverStrategy` which is `RestartAllStrategy` by default. >>> That's the reason why the whole job restarts when a Task Manager has >>> gone. As Paul said, you could try "region restart failover strategy" when >>> 1.9 is released. It might be helpful however it depends on your job >>> topology. >>> >>> The deeper reason of this issue is the consistency semantics of Flink, >>> AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there >>> is no much choice of `FailoverStrategy`. >>> It might be improved in the future. There are some discussions in the >>> mailing list that providing some weaker consistency semantics to improve >>> the `FailoverStrategy`. We are pushing forward this improvement. I hope it >>> can be included in 1.10. >>> >>> Regarding your question, I guess the answer is no for now. A more >>> frequent checkpoint or a savepoint manually triggered might be helpful by a >>> quicker recovery. >>> >>> >>> Paul Lam <paullin3...@gmail.com> 于2019年7月12日周五 上午10:25写道: >>> >>>> Hi, >>>> >>>> Maybe region restart strategy can help. It restarts minimum required >>>> tasks. Note that it’s recommended to use only after 1.9 release, see [1], >>>> unless you’re running a stateless job. >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-10712 >>>> >>>> Best, >>>> Paul Lam >>>> >>>> 在 2019年7月12日,03:38,Aaron Levin <aaronle...@stripe.com> 写道: >>>> >>>> Hello, >>>> >>>> Is there a way to gracefully terminate a Task Manager beyond just >>>> killing it (this seems to be what `./taskmanager.sh stop` does)? >>>> Specifically I'm interested in a way to replace a Task Manager that has >>>> currently-running tasks. It would be great if it was possible to terminate >>>> a Task Manager without restarting the job, though I'm not sure if this is >>>> possible. >>>> >>>> Context: at my work we regularly cycle our hosts for maintenance and >>>> security. Each time we do this we stop the task manager running on the host >>>> being cycled. This causes the entire job to restart, resulting in downtime >>>> for the job. I'd love to decrease this downtime if at all possible. >>>> >>>> Thanks! Any insight is appreciated! >>>> >>>> Best, >>>> >>>> Aaron Levin >>>> >>>> >>>>