Hi Aaron,

>From my understanding, you want shutting down a Task Manager without
restart the job which has tasks running on this Task Manager?

Based on current implementation, if there is a Task Manager is down, the
tasks on it would be treated as failed. The behavior of task failure is
defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
That's the reason why the whole job restarts when a Task Manager has gone.
As Paul said, you could try "region restart failover strategy" when 1.9 is
released. It might be helpful however it depends on your job topology.

The deeper reason of this issue is the consistency semantics of Flink,
AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there
is no much choice of `FailoverStrategy`.
It might be improved in the future. There are some discussions in the
mailing list that providing some weaker consistency semantics to improve
the `FailoverStrategy`. We are pushing forward this improvement. I hope it
can be included in 1.10.

Regarding your question, I guess the answer is no for now. A more frequent
checkpoint or a savepoint manually triggered might be helpful by a quicker
recovery.


Paul Lam <paullin3...@gmail.com> 于2019年7月12日周五 上午10:25写道:

> Hi,
>
> Maybe region restart strategy can help. It restarts minimum required
> tasks. Note that it’s recommended to use only after 1.9 release, see [1],
> unless you’re running a stateless job.
>
> [1] https://issues.apache.org/jira/browse/FLINK-10712
>
> Best,
> Paul Lam
>
> 在 2019年7月12日,03:38,Aaron Levin <aaronle...@stripe.com> 写道:
>
> Hello,
>
> Is there a way to gracefully terminate a Task Manager beyond just killing
> it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm
> interested in a way to replace a Task Manager that has currently-running
> tasks. It would be great if it was possible to terminate a Task Manager
> without restarting the job, though I'm not sure if this is possible.
>
> Context: at my work we regularly cycle our hosts for maintenance and
> security. Each time we do this we stop the task manager running on the host
> being cycled. This causes the entire job to restart, resulting in downtime
> for the job. I'd love to decrease this downtime if at all possible.
>
> Thanks! Any insight is appreciated!
>
> Best,
>
> Aaron Levin
>
>
>

Reply via email to