[DISCUSS] FLIP-135: Approximate Task-Local Recovery

Yuan Mei Sun, 16 Aug 2020 07:49:25 -0700

Hi Devs,

I would like to start a formal discussion about "FLIP-135 Approximate
Task-Local Recovery" [1] which supports approximate single task failure
recovery without restarting the entire streaming job.

Flink is no longer a pure streaming engine as it was born and has been
extended to fit into many different scenarios over time: batch, AI,
event-driven applications, e.t.c. Approximate task-local recovery is one of
the attempts to fulfill these diversified scenarios and trade data
consistency for fast failure recovery. More specifically, if a task fails,
only the failed task restarts without affecting the rest of the job.
Approximate task-local recovery is similar to
RestartPipelinedRegionFailoverStrategy [2], with two major differences:
- Instead of restarting a connected region, approximate task-local recovery
restarts only the failed task(s) for a streaming job.
- RestartPipelinedRegionFailoverStrategy is exactly-once, while approximate
task-local recovery expects data loss and a bit of data duplication when
sources fail.

Approximate task-local recovery is useful in scenarios where a certain
amount of data loss is tolerable, but a full pipeline restart is not
affordable. A typical use case is online training. Online training jobs are
usually complicated with all-to-all task connections, and a single task
failure with RestartPipelinedRegionFailoverStrategy may result in a
complete restart of the whole pipeline. Besides, the initialization is
time-consuming, including the procedure of loading training models and
starting Python subprocesses, etc. The initialization may take minutes to
complete on average. Hence, we introduce an approximate task-local recovery
strategy to only restart failed tasks.

To ease the discussion, we divide the problem of approximate task-local
recovery into three steps with each step only focusing on addressing a set
of problems, sketched as follows:
1). Sink Recovery, 2). Downstream Recovery and 3). Single Task Recovery.

This FLIP focuses on tackling issues related to the first step "Sink
Recovery", and we would like to collect broader feedback in this dedicated
mail thread.

Best,

Yuan

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures

[DISCUSS] FLIP-135: Approximate Task-Local Recovery

Reply via email to