Re: [DISCUSS] FLIP-135: Approximate Task-Local Recovery

Piotr Nowojski Mon, 19 Oct 2020 04:44:02 -0700

Hi,

I was involved in some off-line discussions and some early stages of the
discussions on this topic. In recent weeks Yuan has also prepared a good
looking prototype and draft changes that look good to me, so I would be +1
for accepting this FLIP.


I guess as there were no objections here, I would suggest starting a vote
thread for this proposal. If someone finds a problem with this FLIP, please
vote -1 there or let us know here.

Best,
Piotrek

niedz., 16 sie 2020 o 17:02 Yuan Mei <[email protected]> napisał(a):

> Hi Devs,
>
> I would like to start a formal discussion about "FLIP-135 Approximate
> Task-Local Recovery" [1] which supports approximate single task failure
> recovery without restarting the entire streaming job.
>
> Flink is no longer a pure streaming engine as it was born and has been
> extended to fit into many different scenarios over time: batch, AI,
> event-driven applications, e.t.c. Approximate task-local recovery is one of
> the attempts to fulfill these diversified scenarios and trade data
> consistency for fast failure recovery. More specifically, if a task fails,
> only the failed task restarts without affecting the rest of the job.
> Approximate task-local recovery is similar to
> RestartPipelinedRegionFailoverStrategy [2], with two major differences:
> - Instead of restarting a connected region, approximate task-local recovery
> restarts only the failed task(s) for a streaming job.
> - RestartPipelinedRegionFailoverStrategy is exactly-once, while approximate
> task-local recovery expects data loss and a bit of data duplication when
> sources fail.
>
> Approximate task-local recovery is useful in scenarios where a certain
> amount of data loss is tolerable, but a full pipeline restart is not
> affordable. A typical use case is online training. Online training jobs are
> usually complicated with all-to-all task connections, and a single task
> failure with RestartPipelinedRegionFailoverStrategy may result in a
> complete restart of the whole pipeline. Besides, the initialization is
> time-consuming, including the procedure of loading training models and
> starting Python subprocesses, etc. The initialization may take minutes to
> complete on average. Hence, we introduce an approximate task-local recovery
> strategy to only restart failed tasks.
>
> To ease the discussion, we divide the problem of approximate task-local
> recovery into three steps with each step only focusing on addressing a set
> of problems, sketched as follows:
> 1). Sink Recovery, 2). Downstream Recovery and 3). Single Task Recovery.
>
> This FLIP focuses on tackling issues related to the first step "Sink
> Recovery", and we would like to collect broader feedback in this dedicated
> mail thread.
>
> Best,
>
> Yuan
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
>

Re: [DISCUSS] FLIP-135: Approximate Task-Local Recovery

Reply via email to