Thanks for the fruitful discussion. I also hope that we could remove all the pointers in the HA store(ZK, ConfigMap) in the future. After then, we only rely on the ZK/ConfigMap for leader election/retrieval.
Best, Yang David Morávek <d...@apache.org> 于2021年12月6日周一 下午4:57写道: > as all of the concerns seems to be addressed, I'd like to proceed with the > vote to move things forward. > > Thanks everyone for the feedback, it was really helpful! > > Best, > D. > > On Wed, Dec 1, 2021 at 6:39 AM Zhu Zhu <reed...@gmail.com> wrote: > > > Thanks for the explanation Matthias. The solution sounds good to me. > > I have no more concerns and +1 for the FLIP. > > > > Thanks, > > Zhu > > > > Xintong Song <tonysong...@gmail.com> 于2021年12月1日周三 下午12:56写道: > > > > > @David, > > > > > > Thanks for the clarification. > > > > > > No more concerns from my side. +1 for this FLIP. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Wed, Dec 1, 2021 at 12:28 AM Till Rohrmann <trohrm...@apache.org> > > > wrote: > > > > > > > Given the other breaking changes, I think that it is ok to remove the > > > > `RunningJobsRegistry` completely. > > > > > > > > Since we allow users to specify a HighAvailabilityServices > > implementation > > > > when starting Flink via `high-availability: FQDN`, I think we should > > mark > > > > the interface at least @Experimental. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Tue, Nov 30, 2021 at 2:29 PM Mika Naylor <m...@autophagy.io> > wrote: > > > > > > > > > Hi Till, > > > > > > > > > > We thought that breaking interfaces, specifically > > > > > HighAvailabilityServices and RunningJobsRegistry, was acceptable in > > > this > > > > > instance because: > > > > > > > > > > - Neither of these interfaces are marked @Public and so carry no > > > > > guarantees about being public and stable. > > > > > - As far as we are aware, we currently have no users with custom > > > > > HighAvailabilityServices implementations. > > > > > - The interface was already broken in 1.14 with the changes to > > > > > CheckpointRecoveryFactory, and will likely be changed again in > > 1.15 > > > > > due to further changes in that factory. > > > > > > > > > > Given that, we thought changes to the interface would not be > > > disruptive. > > > > > Perhaps it could be annotated as @Internal - I'm not sure exactly > > what > > > > > guarantees we try and give for the stability of the > > > > > HighAvailabilityServices interface. > > > > > > > > > > Kind regards, > > > > > Mika > > > > > > > > > > On 26.11.2021 18:28, Till Rohrmann wrote: > > > > > >Thanks for creating this FLIP Matthias, Mika and David. > > > > > > > > > > > >I think the JobResultStore is an important piece for fixing > Flink's > > > last > > > > > >high-availability problem (afaik). Once we have this piece in > place, > > > > users > > > > > >no longer risk to re-execute a successfully completed job. > > > > > > > > > > > >I have one comment concerning breaking interfaces: > > > > > > > > > > > >If we don't want to break interfaces, then we could keep the > > > > > >HighAvailabilityServices.getRunningJobsRegistry() method and add a > > > > default > > > > > >implementation for HighAvailabilityServices.getJobResultStore(). > We > > > > could > > > > > >then deprecate the former method and then remove it in the > > subsequent > > > > > >release (1.16). > > > > > > > > > > > >Apart from that, +1 for the FLIP. > > > > > > > > > > > >Cheers, > > > > > >Till > > > > > > > > > > > >On Wed, Nov 17, 2021 at 6:05 PM David Morávek <d...@apache.org> > > > wrote: > > > > > > > > > > > >> Hi everyone, > > > > > >> > > > > > >> Matthias, Mika and I want to start a discussion about > introduction > > > of > > > > a > > > > > new > > > > > >> Flink component, the *JobResultStore*. > > > > > >> > > > > > >> The main motivation is to address shortcomings of the > > > > > *RunningJobsRegistry* > > > > > >> and surpass it with the new component. These shortcomings have > > been > > > > > first > > > > > >> described in FLINK-11813 [1]. > > > > > >> > > > > > >> This change should improve the overall stability of the > > JobManager's > > > > > >> components and address the race conditions in some of the fail > > over > > > > > >> scenarios during the job cleanup lifecycle. > > > > > >> > > > > > >> It should also help to ensure that Flink doesn't leave any > > uncleaned > > > > > >> resources behind. > > > > > >> > > > > > >> We've prepared a FLIP-194 [2], which outlines the design and > > > reasoning > > > > > >> behind this new component. > > > > > >> > > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-11813 > > > > > >> [2] > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435 > > > > > >> > > > > > >> We're looking forward for your feedback ;) > > > > > >> > > > > > >> Best, > > > > > >> Matthias, Mika and David > > > > > >> > > > > > > > > > > Mika Naylor > > > > > https://autophagy.io > > > > > > > > > > > > > > >