[DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Lijie Wang Thu, 02 Nov 2023 03:01:16 -0700

Hi devs,

Zhu Zhu and I would like to start a discussion about FLIP-383: Support Job
Recovery for Batch Jobs[1]


Currently, when Flink’s job manager crashes or gets killed, possibly due to
unexpected errors or planned nodes decommission, it will cause the
following two situations:
1. Failed, if the job does not enable HA.
2. Restart, if the job enable HA. If it’s a streaming job, the job will be
resumed from the last successful checkpoint. If it’s a batch job, it has to
run from beginning, all previous progress will be lost.

In view of this, we think the JM crash may cause great regression for batch
jobs, especially long running batch jobs. This FLIP is mainly to solve this
problem so that batch jobs can recover most job progress after JM crashes.
In this FLIP, our goal is to let most finished tasks not need to be re-run.

You can find more details in the FLIP-383[1]. Looking forward to your
feedback.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-383%3A+Support+Job+Recovery+for+Batch+Jobs

Best,
Lijie

[DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Reply via email to