Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Paul Lam Mon, 04 Dec 2023 22:55:58 -0800

Hi Lijie,

Recovery for batch jobs is no doubt a long-awaited feature. Thanks for
the proposal!


I’m concerned about the multi-job scenario. In session mode, users could
use web submission to upload and run jars which may produce multiple 
Flink jobs. However, these jobs may not be submitted at once and run in 
parallel. Instead, they could be dependent on other jobs like a DAG. The
schedule of the jobs is controlled by the user's main method.

IIUC, in the FLIP, the main method is lost after the recovery, and only 
submitted jobs would be recovered. Is that right?

Best,
Paul Lam

> 2023年11月2日 18:00，Lijie Wang <[email protected]> 写道：
> 
> Hi devs,
> 
> Zhu Zhu and I would like to start a discussion about FLIP-383: Support Job
> Recovery for Batch Jobs[1]
> 
> Currently, when Flink’s job manager crashes or gets killed, possibly due to
> unexpected errors or planned nodes decommission, it will cause the
> following two situations:
> 1. Failed, if the job does not enable HA.
> 2. Restart, if the job enable HA. If it’s a streaming job, the job will be
> resumed from the last successful checkpoint. If it’s a batch job, it has to
> run from beginning, all previous progress will be lost.
> 
> In view of this, we think the JM crash may cause great regression for batch
> jobs, especially long running batch jobs. This FLIP is mainly to solve this
> problem so that batch jobs can recover most job progress after JM crashes.
> In this FLIP, our goal is to let most finished tasks not need to be re-run.
> 
> You can find more details in the FLIP-383[1]. Looking forward to your
> feedback.
> 
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-383%3A+Support+Job+Recovery+for+Batch+Jobs
> 
> Best,
> Lijie

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Reply via email to