Thanks for drafting this FLIP, Matthias, Mika and David.

I like the proposed JobResultStore. Besides addressing the problem of
re-executing finished jobs, it's also an important step towards HA of
multi-job Flink applications.

I have one question that, in the "Cleanup" section, it shows that the
JobMaster is responsible for cleaning up CheckpointCounter/CheckpointStore.
Does this mean Flink will have to re-create
JobMaster/Scheduler/ExecutionGraph for a terminated job to do the cleanup?
If so, this can be heavy in certain cases because the ExecutionGraph
creation may conduct connector initialization. So I'm thinking whether it's
possible to make CheckpointCounter/CheckpointStore a component of
Dispatcher?

Thanks,
Zhu

Till Rohrmann <trohrm...@apache.org> 于2021年11月27日周六 上午1:29写道:

> Thanks for creating this FLIP Matthias, Mika and David.
>
> I think the JobResultStore is an important piece for fixing Flink's last
> high-availability problem (afaik). Once we have this piece in place, users
> no longer risk to re-execute a successfully completed job.
>
> I have one comment concerning breaking interfaces:
>
> If we don't want to break interfaces, then we could keep the
> HighAvailabilityServices.getRunningJobsRegistry() method and add a default
> implementation for HighAvailabilityServices.getJobResultStore(). We could
> then deprecate the former method and then remove it in the subsequent
> release (1.16).
>
> Apart from that, +1 for the FLIP.
>
> Cheers,
> Till
>
> On Wed, Nov 17, 2021 at 6:05 PM David Morávek <d...@apache.org> wrote:
>
> > Hi everyone,
> >
> > Matthias, Mika and I want to start a discussion about introduction of a
> new
> > Flink component, the *JobResultStore*.
> >
> > The main motivation is to address shortcomings of the
> *RunningJobsRegistry*
> > and surpass it with the new component. These shortcomings have been first
> > described in FLINK-11813 [1].
> >
> > This change should improve the overall stability of the JobManager's
> > components and address the race conditions in some of the fail over
> > scenarios during the job cleanup lifecycle.
> >
> > It should also help to ensure that Flink doesn't leave any uncleaned
> > resources behind.
> >
> > We've prepared a FLIP-194 [2], which outlines the design and reasoning
> > behind this new component.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-11813
> > [2]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435
> >
> > We're looking forward for your feedback ;)
> >
> > Best,
> > Matthias, Mika and David
> >
>

Reply via email to