Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Lijie Wang Mon, 11 Dec 2023 20:35:13 -0800

Hi devs,

After an offline discussion with the Apache Celeborn folks, we changed the
signatures of "snapshotState" and "retoreState" as follows:


void snapshotState(CompletableFuture<ShuffleMasterSnapshot> snapshotFuture,
ShuffleMasterSnapshotContext context);
void restoreState(List<ShuffleMasterSnapshot> snapshots);

We believe the above signatures to be more general and flexible:
1. ShuffleMasterSnapshotContext can provide the necessary information for
taking snapshots.
2. Abstracts the ShuffleMasterSnapshot interface, and supports two kinds of
snapshots: full and incremental. We can use incremental snapshots when the
internal state is too large, to make the snapshot operations faster and
occupy less storage.

See FLIP  "JobEvent" section for details.

By the way, if there are no more questions, we will start voting tomorrow.

Best,
Lijie

Lijie Wang <[email protected]> 于2023年12月5日周二 22:57写道：

> Hi Paul,
>
> I believe Xintong has answered your question.
>
> >> IIUC, in the FLIP, the main method is lost after the recovery, and only
> submitted jobs would be recovered. Is that right?
>
> You are right, we can't recover the execution progress of main method. So
> after JM crashs, only the submitted and in-completed jobs (as Xintong said,
> completed jobs will not be re-run) will be recovered and continue to run.
>
> Best,
> Lijie
>
> Xintong Song <[email protected]> 于2023年12月5日周二 18:30写道：
>
>> @Paul,
>>
>>
>> Do you mean the scenario where users call `evn.execute()` multiple times
>> in
>> the `main()` method? I believe that is not supported currently when HA is
>> enabled, for the exact same reason you mentioned that Flink is not aware
>> of
>> which jobs are executed and which are not.
>>
>>
>> On the other hand, if an external scheduler is used to submit multiple
>> jobs
>> to a session cluster, Flink already has a JobResultStore for persisting
>> information about successfully completed jobs, so that only in-completed
>> jobs will be recovered. See FLIP-194[1] for more details.
>>
>>
>> Best,
>>
>> Xintong
>>
>>
>> [1]
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore
>>
>> On Tue, Dec 5, 2023 at 6:01 PM Xintong Song <[email protected]>
>> wrote:
>>
>> > Thanks for addressing my comments, Lijie. LGTM
>> >
>> > Best,
>> >
>> > Xintong
>> >
>> >
>> >
>> > On Tue, Dec 5, 2023 at 2:56 PM Paul Lam <[email protected]> wrote:
>> >
>> >> Hi Lijie,
>> >>
>> >> Recovery for batch jobs is no doubt a long-awaited feature. Thanks for
>> >> the proposal!
>> >>
>> >> I’m concerned about the multi-job scenario. In session mode, users
>> could
>> >> use web submission to upload and run jars which may produce multiple
>> >> Flink jobs. However, these jobs may not be submitted at once and run in
>> >> parallel. Instead, they could be dependent on other jobs like a DAG.
>> The
>> >> schedule of the jobs is controlled by the user's main method.
>> >>
>> >> IIUC, in the FLIP, the main method is lost after the recovery, and only
>> >> submitted jobs would be recovered. Is that right?
>> >>
>> >> Best,
>> >> Paul Lam
>> >>
>> >> > 2023年11月2日 18:00，Lijie Wang <[email protected]> 写道：
>> >> >
>> >> > Hi devs,
>> >> >
>> >> > Zhu Zhu and I would like to start a discussion about FLIP-383:
>> Support
>> >> Job
>> >> > Recovery for Batch Jobs[1]
>> >> >
>> >> > Currently, when Flink’s job manager crashes or gets killed, possibly
>> >> due to
>> >> > unexpected errors or planned nodes decommission, it will cause the
>> >> > following two situations:
>> >> > 1. Failed, if the job does not enable HA.
>> >> > 2. Restart, if the job enable HA. If it’s a streaming job, the job
>> will
>> >> be
>> >> > resumed from the last successful checkpoint. If it’s a batch job, it
>> >> has to
>> >> > run from beginning, all previous progress will be lost.
>> >> >
>> >> > In view of this, we think the JM crash may cause great regression for
>> >> batch
>> >> > jobs, especially long running batch jobs. This FLIP is mainly to
>> solve
>> >> this
>> >> > problem so that batch jobs can recover most job progress after JM
>> >> crashes.
>> >> > In this FLIP, our goal is to let most finished tasks not need to be
>> >> re-run.
>> >> >
>> >> > You can find more details in the FLIP-383[1]. Looking forward to your
>> >> > feedback.
>> >> >
>> >> > [1]
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-383%3A+Support+Job+Recovery+for+Batch+Jobs
>> >> >
>> >> > Best,
>> >> > Lijie
>> >>
>> >>
>>
>

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Reply via email to