Re: [DISCUSS] FLIP-194: Introduce the JobResultStore

Mika Naylor Tue, 30 Nov 2021 05:28:53 -0800

Hi Till,

We thought that breaking interfaces, specifically
HighAvailabilityServices and RunningJobsRegistry, was acceptable in this
instance because:


- Neither of these interfaces are marked @Public and so carry no
  guarantees about being public and stable.
- As far as we are aware, we currently have no users with custom
  HighAvailabilityServices implementations.
- The interface was already broken in 1.14 with the changes to
  CheckpointRecoveryFactory, and will likely be changed again in 1.15
  due to further changes in that factory.

Given that, we thought changes to the interface would not be disruptive.
Perhaps it could be annotated as @Internal - I'm not sure exactly what
guarantees we try and give for the stability of the
HighAvailabilityServices interface.

Kind regards,
Mika

On 26.11.2021 18:28, Till Rohrmann wrote:

Thanks for creating this FLIP Matthias, Mika and David.

I think the JobResultStore is an important piece for fixing Flink's last
high-availability problem (afaik). Once we have this piece in place, users
no longer risk to re-execute a successfully completed job.

I have one comment concerning breaking interfaces:

If we don't want to break interfaces, then we could keep the
HighAvailabilityServices.getRunningJobsRegistry() method and add a default
implementation for HighAvailabilityServices.getJobResultStore(). We could
then deprecate the former method and then remove it in the subsequent
release (1.16).

Apart from that, +1 for the FLIP.

Cheers,
Till

On Wed, Nov 17, 2021 at 6:05 PM David Morávek <[email protected]> wrote:

Hi everyone,

Matthias, Mika and I want to start a discussion about introduction of a new
Flink component, the *JobResultStore*.

The main motivation is to address shortcomings of the *RunningJobsRegistry*
and surpass it with the new component. These shortcomings have been first
described in FLINK-11813 [1].

This change should improve the overall stability of the JobManager's
components and address the race conditions in some of the fail over
scenarios during the job cleanup lifecycle.

It should also help to ensure that Flink doesn't leave any uncleaned
resources behind.

We've prepared a FLIP-194 [2], which outlines the design and reasoning
behind this new component.

[1] https://issues.apache.org/jira/browse/FLINK-11813
[2]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435

We're looking forward for your feedback ;)

Best,
Matthias, Mika and David


Mika Naylor
https://autophagy.io

Re: [DISCUSS] FLIP-194: Introduce the JobResultStore

Reply via email to