Re: The fault tolerance and recovery mechanism in batch mode within Apache Flink.

David Anderson Fri, 16 Feb 2024 18:57:37 -0800

With streaming execution, the entire pipeline is always running, which is
necessary so that results can be continuously produced. But with batch
execution, the job graph can be segmented into separate pipelined stages
that can be executed sequentially, each running to completion before the
next begins. See [1] for more details on how the scheduler is organized.


Each stage writes its results to disk, to be read as input by the
subsequent stage. During recovery, the job could be restarted from the very
beginning, but if the job has completed some intermediate stages, and if
the results of those stages are still available on disk, then it isn't
necessary to re-execute those stages which were successfully completed.

David

[1]
https://flink.apache.org/2020/12/02/improvements-in-task-scheduling-for-batch-workloads-in-apache-flink-1.12/

On Fri, Feb 16, 2024 at 4:29 AM Вова Фролов <vovafrolov1...@gmail.com>
wrote:

> Hi everyone,
>
> I am currently exploring the fault tolerance and recovery mechanism in
> batch mode within Apache Flink.
>
> If I terminate the task manager process while the job is running, the job
> restarts from the point of failure. However, at some point, the job
> restarts from the very beginning.
>
> The documentation mentions that the checkpointing and state backend do not
> work in batch mode.
>
> How does recovery after a failure occur in BATCH mode?
>
> According to the documentation: “In BATCH runtime mode, Flink will attempt
> to return to previous processing steps for which intermediate results are
> still available. Potentially, only those tasks that fail (or their
> predecessors in the graph) will have to be restarted.”
>
>
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/execution_mode/
>
>
>
> I would appreciate any information regarding this matter.
>
> Kind regards,
>
> Vladimir
>

Re: The fault tolerance and recovery mechanism in batch mode within Apache Flink.

Reply via email to