With streaming execution, the entire pipeline is always running, which is necessary so that results can be continuously produced. But with batch execution, the job graph can be segmented into separate pipelined stages that can be executed sequentially, each running to completion before the next begins. See [1] for more details on how the scheduler is organized.
Each stage writes its results to disk, to be read as input by the subsequent stage. During recovery, the job could be restarted from the very beginning, but if the job has completed some intermediate stages, and if the results of those stages are still available on disk, then it isn't necessary to re-execute those stages which were successfully completed. David [1] https://flink.apache.org/2020/12/02/improvements-in-task-scheduling-for-batch-workloads-in-apache-flink-1.12/ On Fri, Feb 16, 2024 at 4:29 AM Вова Фролов <vovafrolov1...@gmail.com> wrote: > Hi everyone, > > I am currently exploring the fault tolerance and recovery mechanism in > batch mode within Apache Flink. > > If I terminate the task manager process while the job is running, the job > restarts from the point of failure. However, at some point, the job > restarts from the very beginning. > > The documentation mentions that the checkpointing and state backend do not > work in batch mode. > > How does recovery after a failure occur in BATCH mode? > > According to the documentation: “In BATCH runtime mode, Flink will attempt > to return to previous processing steps for which intermediate results are > still available. Potentially, only those tasks that fail (or their > predecessors in the graph) will have to be restarted.” > > > https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/execution_mode/ > > > > I would appreciate any information regarding this matter. > > Kind regards, > > Vladimir >