Re: [DISCUSS] Support the session job management in kubernetes operator

Aitozi Tue, 22 Mar 2022 02:49:13 -0700

Hi Thomas:

    Thanks for your valuable question. Let’s make the relationship between
the session deployment and the jobs more clear.


IMO, the session deployment and jobs interact in these situations:

- Create the session job. Then FlinkSessionJobController will wait for the
session cluster ready then submit the job. The look up key is namespace and
clusterId.

- Delete the session job. Then it will cancel the current session job.

- Delete the session deployment. It will have to delete the session job
first, we could set the ownerference of the FlinkSessionJob to let the
Kubernetes trigger the cleanup session jobs before removing the session
deployment.

- Upgrade the session deployment. It will be a critical part, because it
will affect all the session jobs. We should suspend the job first and then
upgrade the session cluster. So I tend to validate that all the jobs are
suspended and then perform the session cluster upgrade. After upgrade then
change the session jobs to running manually.

What do you think about this? If there is no objection, I will clarify it
in the FLIP doc.


Besides, sorry for the rough vote and discussion process. It's my first
time driving this, I will keep that in mind next time :)
Best,
Aitozi.

Yang Wang <[email protected]> 于2022年3月22日周二 10:11写道：

> I think the session cluster could not be deleted unless all the running
> jobs have finished or cancelled. I agree this should be clarified in the
> FLIP.
>
> Best,
> Yang
>
> Thomas Weise <[email protected]> 于2022年3月22日周二 09:26写道：
>
> > Hi Aitozi,
> >
> > Thanks for the proposal. Can you please clarify in the FLIP the
> > relationship between the session deployment and the jobs that depend on
> it?
> > Will, for example, the operator ensure that the individual jobs are
> > deleted when the underlying cluster is deleted?
> >
> > Side note: When the discussion thread started 5 days ago and a FLIP vote
> > was started 2 days later and there is also a weekend included, then this
> is
> > probably on the short side for broader feedback.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Fri, Mar 18, 2022 at 4:01 AM Yang Wang <[email protected]> wrote:
> >
> > > Great work. Since we are introducing a new public API, it deserves a
> > FLIP.
> > > And the FLIP will help the later contributors catch up soon.
> > >
> > > Best,
> > > Yang
> > >
> > > Gyula Fóra <[email protected]> 于2022年3月18日周五 18:11写道：
> > >
> > > > Thank Aitozi, a FLIP might be an overkill at this point but no harm
> in
> > > > voting on it anyways :)
> > > >
> > > > Looks good!
> > > >
> > > > Gyula
> > > >
> > > > On Fri, Mar 18, 2022 at 10:25 AM Aitozi <[email protected]>
> wrote:
> > > >
> > > > > Hi Guys:
> > > > >
> > > > >     FYI, I have integrated your comments and drawn the
> FLIP-215[1], I
> > > > will
> > > > > create another thread to vote for it.
> > > > >
> > > > > [1]:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-215%3A+Introduce+FlinkSessionJob+CRD+in+the+kubernetes+operator
> > > > >
> > > > > Best,
> > > > >
> > > > > Aitozi.
> > > > >
> > > > >
> > > > > Aitozi <[email protected]> 于2022年3月17日周四 11:16写道：
> > > > >
> > > > > > Hi Biao Geng:
> > > > > >
> > > > > >    Thanks for your feedback, I'm +1 to go with option#2. It's a
> > good
> > > > > > point that
> > > > > >
> > > > > > we should improve the error message debugging for the session
> job,
> > I
> > > > > > think
> > > > > >
> > > > > > it can be a follow up work as an improvement after we support the
> > > > session
> > > > > > job operation.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Aitozi.
> > > > > >
> > > > > >
> > > > > > Geng Biao <[email protected]> 于2022年3月17日周四 10:55写道：
> > > > > >
> > > > > >> Thanks Aitozi for the work!
> > > > > >>
> > > > > >> I lean to option#2 of using JarRunHeaders with uber job jar as
> > well.
> > > > As
> > > > > >> Yang said, the user defined dependencies may be better supported
> > in
> > > > > >> upstream flink.
> > > > > >> A follow-up thought: I think we should care the  potential
> > influence
> > > > on
> > > > > >> user experiences: as the job graph is generated in JM, when the
> > > > > generation
> > > > > >> fails due to some issues in the main() method, we should do some
> > > work
> > > > on
> > > > > >> showing such error messages in this proposal or the later k8s
> > > operator
> > > > > >> implementation.  Reason for this question is that if users
> submit
> > > many
> > > > > jobs
> > > > > >> to one same session cluster, it may be not easy for them to find
> > > > > relevant
> > > > > >> error logs about main() method of a specific job. The
> FLINK-25715
> > > > could
> > > > > >> help us later.
> > > > > >>
> > > > > >>
> > > > > >> Best,
> > > > > >> Biao Geng
> > > > > >>
> > > > > >>
> > > > > >> 发件人: Aitozi <[email protected]>
> > > > > >> 日期: 星期三, 2022年3月16日 下午5:19
> > > > > >> 收件人: [email protected] <[email protected]>
> > > > > >> 主题: Re: [DISCUSS] Support the session job management in
> kubernetes
> > > > > >> operator
> > > > > >> Hi Yang Wang
> > > > > >>     Thanks for your feedback, Provide the local and http
> > > > implementation
> > > > > >> for
> > > > > >> the first version makes sense to me.
> > > > > >> +1 for it.
> > > > > >>
> > > > > >> Best,
> > > > > >> Aitozi
> > > > > >>
> > > > > >> Yang Wang <[email protected]> 于2022年3月16日周三 16:44写道：
> > > > > >>
> > > > > >> > # How to download the user jars
> > > > > >> > I agree with Gyula that it will be a burden if we bundle the
> > flink
> > > > > >> > filesystem dependencies in the operator image.
> > > > > >> > Maybe we could have a *ArtifactFetcher* interface in the
> > > > > >> > flink-kubernetes-operator. By default, we provide the local
> and
> > > http
> > > > > >> > implementation,
> > > > > >> > which means we could get the user jars from local files or
> HTTP
> > > > URLs.
> > > > > >> Flink
> > > > > >> > filesystem support could be done as a follow-up based on the
> > > > feedback.
> > > > > >> >
> > > > > >> > If the user wants to use the local implementation, they need
> to
> > > > mount
> > > > > a
> > > > > >> > PV(aka persist volume) to the operator first and then put
> their
> > > jars
> > > > > >> into
> > > > > >> > the PV.
> > > > > >> >
> > > > > >> > # How to talk to session JobManager to submit the job
> > > > > >> > After more consideration, I also prefer the second approach,
> via
> > > > REST
> > > > > >> API
> > > > > >> > /jars/:jarid/run. If we have strong requirements to support
> > > > > dependencies
> > > > > >> > jars and
> > > > > >> > artifacts, we could try to support this in the upstream
> project.
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Yang
> > > > > >> >
> > > > > >> >
> > > > > >> > Aitozi <[email protected]> 于2022年3月16日周三 16:11写道：
> > > > > >> >
> > > > > >> > > Hi Gyula
> > > > > >> > >     Thanks for your quick response. Regarding the different
> > > > > >> filesystems
> > > > > >> > > dependency,
> > > > > >> > > I think we can make it optional and pluggable, and let it
> > choose
> > > > by
> > > > > >> user
> > > > > >> > > when building
> > > > > >> > > their operator image. Users can build their image from the
> > base
> > > > > >> operator
> > > > > >> > > image and
> > > > > >> > > add filesystem dependency they want to use to it. BTW, we
> can
> > > > > support
> > > > > >> the
> > > > > >> > > http URI
> > > > > >> > > by default.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Aitozi.
> > > > > >> > >
> > > > > >> > > Gyula Fóra <[email protected]> 于2022年3月16日周三 15:53写道：
> > > > > >> > >
> > > > > >> > > > Thank you Aitozi!
> > > > > >> > > >
> > > > > >> > > > I think this will be a very nice (and simple) addition to
> > > enable
> > > > > >> these
> > > > > >> > > > use-cases.
> > > > > >> > > >
> > > > > >> > > > I have 2 comments regarding the proposal:
> > > > > >> > > >
> > > > > >> > > > 1. I think if we want to support different filesystems to
> > > > download
> > > > > >> jars
> > > > > >> > > > from, we probably need some clever ways to add external
> > > operator
> > > > > >> > > > dependencies (jars, configs).
> > > > > >> > > > I would prefer not to bundle them into the base operator
> > > image.
> > > > > >> > > >
> > > > > >> > > > 2. I think we should avoid creating the jobgraphs on the
> > > > operator
> > > > > >> side
> > > > > >> > > and
> > > > > >> > > > use the jar upload/run rest api instead as you suggested.
> > This
> > > > > will
> > > > > >> > avoid
> > > > > >> > > > flink version and dependency conflicts.
> > > > > >> > > >
> > > > > >> > > > Cheers,
> > > > > >> > > > Gyula
> > > > > >> > > >
> > > > > >> > > > On Wed, Mar 16, 2022 at 8:41 AM Aitozi <
> > [email protected]>
> > > > > >> wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi Guys:
> > > > > >> > > > >
> > > > > >> > > > >     I would like to open a discussion for support
> session
> > > job
> > > > > >> > > management
> > > > > >> > > > in
> > > > > >> > > > > kubernetes operator. It’s intended to enhance the
> > > > > >> > > > flink-kubernetes-operator
> > > > > >> > > > > to manage the session job with k8s tooling. I have
> drafted
> > > the
> > > > > >> design
> > > > > >> > > > > doc[1]. Please refer to it and give me some feedback .
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > [1]
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1WPGbur1eT3H_5gN-kyXfp7EDjdbJUURx6jN8nt6UT-s/edit#
> > > > > >> <
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1WPGbur1eT3H_5gN-kyXfp7EDjdbJUURx6jN8nt6UT-s/edit
> > > > > >> >
> > > > > >> > > > >
> > > > > >> > > > > Best,
> > > > > >> > > > >
> > > > > >> > > > > Aitozi.
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Support the session job management in kubernetes operator

Reply via email to