Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Lijie Wang Thu, 30 Nov 2023 01:47:07 -0800

Hi Guowei,

Thanks for your feedback.


>> As far as I know, there are multiple job managers on standby in some
scenarios. In this case, is your design still effective?
I think it's still effective. There will only be one leader. After becoming
the leader, the startup process of JobMaster is the same as only one
jobmanger restarts, so I think the current process should also be
applicable to multi-jobmanager situation. We will also do some tests to
cover this case.

>> How do you rule out that there might still be some states in the memory
of the original operator coordinator?
Current restore process is the same as steraming jobs restore from
checkpoint(call the same methods) after failover, which is widely used in
production, so I think there is no problem.

>> Additionally, using NO_CHECKPOINT seems a bit odd. Why not use a normal
checkpoint ID greater than 0 and record it in the event store?
We use -1(NO_CHECKPOINT) to distinguish it from a normal checkpoint, -1
indicates that this is a snapshot for the no-checkpoint/batch scenarios.

Besides, considering that currently some operator coordinators may not
support taking snapshots in the no-checkpint/batch scenarios (or don't
support passing -1 as a checkpoint id), we think it is better to let the
developer explicitly specify whether it supports snapshots in the batch
scenario. Therefore, we intend to introduce the "SupportsBatchSnapshot"
interface for split enumerator and the "supportsBatchSnapshot" method for
operator coordinator. You can find more details in FLIP "Introduce
SupportsBatchSnapshot interface" and "JobEvent" sections.

Looking forward to your further feedback.

Best,
Lijie

Guowei Ma <guowei....@gmail.com> 于2023年11月19日周日 10:47写道：

> Hi,
>
>
> This is a very good proposal, as far as I know, it can solve some very
> critical production operations in certain scenarios. I have two minor
> issues:
>
> As far as I know, there are multiple job managers on standby in some
> scenarios. In this case, is your design still effective? I'm unsure if you
> have conducted any tests. For instance, standby job managers might take
> over these failed jobs more quickly.
> Regarding the part about the operator coordinator, how can you ensure that
> the checkpoint mechanism can restore the state of the operator coordinator:
> For example:
> How do you rule out that there might still be some states in the memory of
> the original operator coordinator? After all, the implementation was done
> under the assumption of scenarios where the job manager doesn't fail.
> Additionally, using NO_CHECKPOINT seems a bit odd. Why not use a normal
> checkpoint ID greater than 0 and record it in the event store?
> If the issues raised in point 2 cannot be resolved in the short term, would
> it be possible to consider not supporting failover with a source job
> manager?
>
> Best,
> Guowei
>
>
> On Thu, Nov 2, 2023 at 6:01 PM Lijie Wang <wangdachui9...@gmail.com>
> wrote:
>
> > Hi devs,
> >
> > Zhu Zhu and I would like to start a discussion about FLIP-383: Support
> Job
> > Recovery for Batch Jobs[1]
> >
> > Currently, when Flink’s job manager crashes or gets killed, possibly due
> to
> > unexpected errors or planned nodes decommission, it will cause the
> > following two situations:
> > 1. Failed, if the job does not enable HA.
> > 2. Restart, if the job enable HA. If it’s a streaming job, the job will
> be
> > resumed from the last successful checkpoint. If it’s a batch job, it has
> to
> > run from beginning, all previous progress will be lost.
> >
> > In view of this, we think the JM crash may cause great regression for
> batch
> > jobs, especially long running batch jobs. This FLIP is mainly to solve
> this
> > problem so that batch jobs can recover most job progress after JM
> crashes.
> > In this FLIP, our goal is to let most finished tasks not need to be
> re-run.
> >
> > You can find more details in the FLIP-383[1]. Looking forward to your
> > feedback.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-383%3A+Support+Job+Recovery+for+Batch+Jobs
> >
> > Best,
> > Lijie
> >
>

Re: [DISCUSS] FLIP-383: Support Job Recovery for Batch Jobs

Reply via email to