> We should check there's no running Flink job before deleting a session FlinkDeployment
If we have to prevent stopping the session cluster before all session jobs are down already. I think we should avoid deleting the session deployment by returning DeleteControl#noFinalizerRemoval()[1] in cleanup, And then schedule the reconcile to check and delete the session cluster until there is no session job instance. [1]: https://github.com/java-operator-sdk/java-operator-sdk/blob/b91221bb54af19761a617bf18eef381e8ceb3b4c/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/Reconciler.java#L14 Yang Wang <danrtsey...@gmail.com> 于2022年3月22日周二 18:48写道: > The relationship between the session deployment and the Flink jobs looks > good to me except for the session deployment deletion. > > I strongly suggest not to set the ownerference of the FlinkSessionJob to > the session FlinkDeployment. > Otherwise, it will be a disaster if the session FlinkDeployment is deleted > accidentally and there are many running jobs. > We should check there's no running Flink job before deleting a session > FlinkDeployment. And this will force the users to have a double > confirmation. > > Best, > Yang > > > Aitozi <gjying1...@gmail.com> 于2022年3月22日周二 17:49写道: > > > Hi Thomas: > > > > Thanks for your valuable question. Let’s make the relationship > between > > the session deployment and the jobs more clear. > > > > IMO, the session deployment and jobs interact in these situations: > > > > - Create the session job. Then FlinkSessionJobController will wait for > the > > session cluster ready then submit the job. The look up key is namespace > and > > clusterId. > > > > - Delete the session job. Then it will cancel the current session job. > > > > - Delete the session deployment. It will have to delete the session job > > first, we could set the ownerference of the FlinkSessionJob to let the > > Kubernetes trigger the cleanup session jobs before removing the session > > deployment. > > > > - Upgrade the session deployment. It will be a critical part, because it > > will affect all the session jobs. We should suspend the job first and > then > > upgrade the session cluster. So I tend to validate that all the jobs are > > suspended and then perform the session cluster upgrade. After upgrade > then > > change the session jobs to running manually. > > > > What do you think about this? If there is no objection, I will clarify it > > in the FLIP doc. > > > > > > Besides, sorry for the rough vote and discussion process. It's my first > > time driving this, I will keep that in mind next time :) > > Best, > > Aitozi. > > > > Yang Wang <danrtsey...@gmail.com> 于2022年3月22日周二 10:11写道: > > > > > I think the session cluster could not be deleted unless all the running > > > jobs have finished or cancelled. I agree this should be clarified in > the > > > FLIP. > > > > > > Best, > > > Yang > > > > > > Thomas Weise <t...@apache.org> 于2022年3月22日周二 09:26写道: > > > > > > > Hi Aitozi, > > > > > > > > Thanks for the proposal. Can you please clarify in the FLIP the > > > > relationship between the session deployment and the jobs that depend > on > > > it? > > > > Will, for example, the operator ensure that the individual jobs are > > > > deleted when the underlying cluster is deleted? > > > > > > > > Side note: When the discussion thread started 5 days ago and a FLIP > > vote > > > > was started 2 days later and there is also a weekend included, then > > this > > > is > > > > probably on the short side for broader feedback. > > > > > > > > Thanks, > > > > Thomas > > > > > > > > > > > > On Fri, Mar 18, 2022 at 4:01 AM Yang Wang <danrtsey...@gmail.com> > > wrote: > > > > > > > > > Great work. Since we are introducing a new public API, it deserves > a > > > > FLIP. > > > > > And the FLIP will help the later contributors catch up soon. > > > > > > > > > > Best, > > > > > Yang > > > > > > > > > > Gyula Fóra <gyula.f...@gmail.com> 于2022年3月18日周五 18:11写道: > > > > > > > > > > > Thank Aitozi, a FLIP might be an overkill at this point but no > harm > > > in > > > > > > voting on it anyways :) > > > > > > > > > > > > Looks good! > > > > > > > > > > > > Gyula > > > > > > > > > > > > On Fri, Mar 18, 2022 at 10:25 AM Aitozi <gjying1...@gmail.com> > > > wrote: > > > > > > > > > > > > > Hi Guys: > > > > > > > > > > > > > > FYI, I have integrated your comments and drawn the > > > FLIP-215[1], I > > > > > > will > > > > > > > create another thread to vote for it. > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-215%3A+Introduce+FlinkSessionJob+CRD+in+the+kubernetes+operator > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > Aitozi. > > > > > > > > > > > > > > > > > > > > > Aitozi <gjying1...@gmail.com> 于2022年3月17日周四 11:16写道: > > > > > > > > > > > > > > > Hi Biao Geng: > > > > > > > > > > > > > > > > Thanks for your feedback, I'm +1 to go with option#2. > It's a > > > > good > > > > > > > > point that > > > > > > > > > > > > > > > > we should improve the error message debugging for the session > > > job, > > > > I > > > > > > > > think > > > > > > > > > > > > > > > > it can be a follow up work as an improvement after we support > > the > > > > > > session > > > > > > > > job operation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > Aitozi. > > > > > > > > > > > > > > > > > > > > > > > > Geng Biao <biaoge...@gmail.com> 于2022年3月17日周四 10:55写道: > > > > > > > > > > > > > > > >> Thanks Aitozi for the work! > > > > > > > >> > > > > > > > >> I lean to option#2 of using JarRunHeaders with uber job jar > as > > > > well. > > > > > > As > > > > > > > >> Yang said, the user defined dependencies may be better > > supported > > > > in > > > > > > > >> upstream flink. > > > > > > > >> A follow-up thought: I think we should care the potential > > > > influence > > > > > > on > > > > > > > >> user experiences: as the job graph is generated in JM, when > > the > > > > > > > generation > > > > > > > >> fails due to some issues in the main() method, we should do > > some > > > > > work > > > > > > on > > > > > > > >> showing such error messages in this proposal or the later > k8s > > > > > operator > > > > > > > >> implementation. Reason for this question is that if users > > > submit > > > > > many > > > > > > > jobs > > > > > > > >> to one same session cluster, it may be not easy for them to > > find > > > > > > > relevant > > > > > > > >> error logs about main() method of a specific job. The > > > FLINK-25715 > > > > > > could > > > > > > > >> help us later. > > > > > > > >> > > > > > > > >> > > > > > > > >> Best, > > > > > > > >> Biao Geng > > > > > > > >> > > > > > > > >> > > > > > > > >> 发件人: Aitozi <gjying1...@gmail.com> > > > > > > > >> 日期: 星期三, 2022年3月16日 下午5:19 > > > > > > > >> 收件人: dev@flink.apache.org <dev@flink.apache.org> > > > > > > > >> 主题: Re: [DISCUSS] Support the session job management in > > > kubernetes > > > > > > > >> operator > > > > > > > >> Hi Yang Wang > > > > > > > >> Thanks for your feedback, Provide the local and http > > > > > > implementation > > > > > > > >> for > > > > > > > >> the first version makes sense to me. > > > > > > > >> +1 for it. > > > > > > > >> > > > > > > > >> Best, > > > > > > > >> Aitozi > > > > > > > >> > > > > > > > >> Yang Wang <danrtsey...@gmail.com> 于2022年3月16日周三 16:44写道: > > > > > > > >> > > > > > > > >> > # How to download the user jars > > > > > > > >> > I agree with Gyula that it will be a burden if we bundle > the > > > > flink > > > > > > > >> > filesystem dependencies in the operator image. > > > > > > > >> > Maybe we could have a *ArtifactFetcher* interface in the > > > > > > > >> > flink-kubernetes-operator. By default, we provide the > local > > > and > > > > > http > > > > > > > >> > implementation, > > > > > > > >> > which means we could get the user jars from local files or > > > HTTP > > > > > > URLs. > > > > > > > >> Flink > > > > > > > >> > filesystem support could be done as a follow-up based on > the > > > > > > feedback. > > > > > > > >> > > > > > > > > >> > If the user wants to use the local implementation, they > need > > > to > > > > > > mount > > > > > > > a > > > > > > > >> > PV(aka persist volume) to the operator first and then put > > > their > > > > > jars > > > > > > > >> into > > > > > > > >> > the PV. > > > > > > > >> > > > > > > > > >> > # How to talk to session JobManager to submit the job > > > > > > > >> > After more consideration, I also prefer the second > approach, > > > via > > > > > > REST > > > > > > > >> API > > > > > > > >> > /jars/:jarid/run. If we have strong requirements to > support > > > > > > > dependencies > > > > > > > >> > jars and > > > > > > > >> > artifacts, we could try to support this in the upstream > > > project. > > > > > > > >> > > > > > > > > >> > Best, > > > > > > > >> > Yang > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > Aitozi <gjying1...@gmail.com> 于2022年3月16日周三 16:11写道: > > > > > > > >> > > > > > > > > >> > > Hi Gyula > > > > > > > >> > > Thanks for your quick response. Regarding the > > different > > > > > > > >> filesystems > > > > > > > >> > > dependency, > > > > > > > >> > > I think we can make it optional and pluggable, and let > it > > > > choose > > > > > > by > > > > > > > >> user > > > > > > > >> > > when building > > > > > > > >> > > their operator image. Users can build their image from > the > > > > base > > > > > > > >> operator > > > > > > > >> > > image and > > > > > > > >> > > add filesystem dependency they want to use to it. BTW, > we > > > can > > > > > > > support > > > > > > > >> the > > > > > > > >> > > http URI > > > > > > > >> > > by default. > > > > > > > >> > > > > > > > > > >> > > Thanks, > > > > > > > >> > > Aitozi. > > > > > > > >> > > > > > > > > > >> > > Gyula Fóra <gyula.f...@gmail.com> 于2022年3月16日周三 > 15:53写道: > > > > > > > >> > > > > > > > > > >> > > > Thank you Aitozi! > > > > > > > >> > > > > > > > > > > >> > > > I think this will be a very nice (and simple) addition > > to > > > > > enable > > > > > > > >> these > > > > > > > >> > > > use-cases. > > > > > > > >> > > > > > > > > > > >> > > > I have 2 comments regarding the proposal: > > > > > > > >> > > > > > > > > > > >> > > > 1. I think if we want to support different filesystems > > to > > > > > > download > > > > > > > >> jars > > > > > > > >> > > > from, we probably need some clever ways to add > external > > > > > operator > > > > > > > >> > > > dependencies (jars, configs). > > > > > > > >> > > > I would prefer not to bundle them into the base > operator > > > > > image. > > > > > > > >> > > > > > > > > > > >> > > > 2. I think we should avoid creating the jobgraphs on > the > > > > > > operator > > > > > > > >> side > > > > > > > >> > > and > > > > > > > >> > > > use the jar upload/run rest api instead as you > > suggested. > > > > This > > > > > > > will > > > > > > > >> > avoid > > > > > > > >> > > > flink version and dependency conflicts. > > > > > > > >> > > > > > > > > > > >> > > > Cheers, > > > > > > > >> > > > Gyula > > > > > > > >> > > > > > > > > > > >> > > > On Wed, Mar 16, 2022 at 8:41 AM Aitozi < > > > > gjying1...@gmail.com> > > > > > > > >> wrote: > > > > > > > >> > > > > > > > > > > >> > > > > Hi Guys: > > > > > > > >> > > > > > > > > > > > >> > > > > I would like to open a discussion for support > > > session > > > > > job > > > > > > > >> > > management > > > > > > > >> > > > in > > > > > > > >> > > > > kubernetes operator. It’s intended to enhance the > > > > > > > >> > > > flink-kubernetes-operator > > > > > > > >> > > > > to manage the session job with k8s tooling. I have > > > drafted > > > > > the > > > > > > > >> design > > > > > > > >> > > > > doc[1]. Please refer to it and give me some > feedback . > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > [1] > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1WPGbur1eT3H_5gN-kyXfp7EDjdbJUURx6jN8nt6UT-s/edit# > > > > > > > >> < > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1WPGbur1eT3H_5gN-kyXfp7EDjdbJUURx6jN8nt6UT-s/edit > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > >> > > > > Best, > > > > > > > >> > > > > > > > > > > > >> > > > > Aitozi. > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >