Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Xintong Song Sun, 06 Feb 2022 18:22:01 -0800

Sorry for the late reply. We were out due to the public holidays in China.

@Thomas,


The intention is to support application management through operator and CR,
> which means there won't be any 2 step submission process, which as you
> allude to would defeat the purpose of this project. The CR example shows
> the application part. Please note that the bare cluster support is an
> *additional* feature for scenarios that require external job management. Is
> there anything on the FLIP page that creates a different impression?
>

Sounds good to me. I don't remember what created the impression of 2 step
submission back then. I revisited the latest version of this FLIP and it
looks good to me.

@Gyula,

Versioning:
> Versioning will be independent from Flink and the operator will depend on a
> fixed flink version (in every given operator version).
> This should be the exact same setup as with Stateful Functions (
> https://github.com/apache/flink-statefun). So independent release cycle
> but
> still within the Flink umbrella.
>

Does this mean if someone wants to upgrade Flink to a version that is
released after the operator version that is being used, he/she would need
to upgrade the operator version first?
I'm not questioning this, just trying to make sure I'm understanding this
correctly.

Thank you~

Xintong Song



On Mon, Feb 7, 2022 at 3:14 AM Gyula Fóra <[email protected]> wrote:

> Thank you Alexis,
>
> Will definitely check this out. You are right, Kotlin makes it difficult to
> adopt pieces of this code directly but I think it will be good to get
> inspiration for the architecture and look at how particular problems have
> been solved. It will be a great help for us I am sure.
>
> Cheers,
> Gyula
>
> On Sat, Feb 5, 2022 at 12:28 PM Alexis Sarda-Espinosa <
> [email protected]> wrote:
>
> > Hi everyone,
> >
> > just wanted to mention that my employer agreed to open source the PoC I
> > developed: https://github.com/MicroFocus/opsb-flink-k8s-operator
> >
> > I understand the concern for maintainability, so Gradle & Kotlin might
> not
> > be appealing to you, but at least it gives you another reference. The
> Helm
> > resources in particular might be useful.
> >
> > There are bits and pieces there referring to Flink sessions, but those
> are
> > just placeholders, the functioning parts use application mode with native
> > integration.
> >
> > Regards,
> > Alexis.
> >
> > ________________________________
> > From: Thomas Weise <[email protected]>
> > Sent: Saturday, February 5, 2022 2:41 AM
> > To: dev <[email protected]>
> > Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator
> >
> > Hi,
> >
> > Thanks for the continued feedback and discussion. Looks like we are
> > ready to start a VOTE, I will initiate it shortly.
> >
> > In parallel it would be good to find the repository name.
> >
> > My suggestion would be: flink-kubernetes-operator
> >
> > I thought "flink-operator" could be a bit misleading since the term
> > operator already has a meaning in Flink.
> >
> > I also considered "flink-k8s-operator" but that would be almost
> > identical to existing operator implementations and could lead to
> > confusion in the future.
> >
> > Thoughts?
> >
> > Thanks,
> > Thomas
> >
> >
> >
> > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <[email protected]> wrote:
> > >
> > > Hi Danny,
> > >
> > > So far we have been focusing our dev efforts on the initial native
> > > implementation with the team.
> > > If the discussion and vote goes well for this FLIP we are looking
> forward
> > > to contributing the initial version sometime next week (fingers
> crossed).
> > >
> > > At that point I think we can already start the dev work to support the
> > > standalone mode as well, especially if you can dedicate some effort to
> > > pushing that side.
> > > Working together on this sounds like a great idea and we should start
> as
> > > soon as possible! :)
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <[email protected]>
> > > wrote:
> > >
> > > > I have been discussing this one with my team. We are interested in
> the
> > > > Standalone mode, and are willing to contribute towards the
> > implementation.
> > > > Potentially we can work together to support both modes in parallel?
> > > >
> > > > Thanks,
> > > >
> > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <[email protected]>
> > wrote:
> > > >
> > > > > Hi Danny!
> > > > >
> > > > > Thanks for the feedback :)
> > > > >
> > > > > Versioning:
> > > > > Versioning will be independent from Flink and the operator will
> > depend
> > > > on a
> > > > > fixed flink version (in every given operator version).
> > > > > This should be the exact same setup as with Stateful Functions (
> > > > > https://github.com/apache/flink-statefun). So independent release
> > cycle
> > > > > but
> > > > > still within the Flink umbrella.
> > > > >
> > > > > Deployment error handling:
> > > > > I think that's a very good point, as general exception handling for
> > the
> > > > > different failure scenarios is a tricky problem. I think the
> > exception
> > > > > classifiers and retry strategies could avoid a lot of manual
> > intervention
> > > > > from the user. We will definitely need to add something like this.
> > Once
> > > > we
> > > > > have the repo created with the initial operator code we should open
> > some
> > > > > tickets for this and put it on the short term roadmap!
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hey team,
> > > > > >
> > > > > > Great work on the FLIP, I am looking forward to this one. I agree
> > that
> > > > we
> > > > > > can move forward to the voting stage.
> > > > > >
> > > > > > I have general feedback around how we will handle job submission
> > > > failure
> > > > > > and retry. As discussed in the Rejected Alternatives section, we
> > can
> > > > use
> > > > > > Java to handle job submission failures from the Flink client. It
> > would
> > > > be
> > > > > > useful to have the ability to configure exception classifiers and
> > retry
> > > > > > strategy as part of operator configuration.
> > > > > >
> > > > > > Given this will be in a separate Github repository I am curious
> how
> > > > ther
> > > > > > versioning strategy will work in relation to the Flink version?
> Do
> > we
> > > > > have
> > > > > > any other components with a similar setup I can look at? Will the
> > > > > operator
> > > > > > version track Flink or will it use its own versioning strategy
> > with a
> > > > > Flink
> > > > > > version support matrix, or similar?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi team,
> > > > > > >
> > > > > > > Thank you for the great feedback, Thomas has updated the FLIP
> > page
> > > > > > > accordingly. If you are comfortable with the currently existing
> > > > design
> > > > > > and
> > > > > > > depth in the FLIP [1] I suggest moving forward to the voting
> > stage -
> > > > > once
> > > > > > > that reaches a positive conclusion it lets us create the
> separate
> > > > code
> > > > > > > repository under the flink project for the operator.
> > > > > > >
> > > > > > > I encourage everyone to keep improving the details in the
> > meantime,
> > > > > > however
> > > > > > > I believe given the existing design and the general sentiment
> on
> > this
> > > > > > > thread that the most efficient path from here is starting the
> > > > > > > implementation so that we can collectively iterate over it.
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > >
> > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <[email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > > HI Xintong,
> > > > > > > >
> > > > > > > > Thanks for the feedback and please see responses below -->
> > > > > > > >
> > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > > > > > discussion.
> > > > > > > > >
> > > > > > > > > I also have a few questions and comments.
> > > > > > > > >
> > > > > > > > > ## Job Submission
> > > > > > > > > Deploying a Flink session cluster via kubectl & CR and then
> > > > > > submitting
> > > > > > > > jobs
> > > > > > > > > to the cluster via Flink cli / REST is probably the
> approach
> > that
> > > > > > > > requires
> > > > > > > > > the least effort. However, I'd like to point out 2
> > weaknesses.
> > > > > > > > > 1. A lot of users use Flink in perjob/application modes.
> For
> > > > these
> > > > > > > users,
> > > > > > > > > having to run the job in two steps (deploy the cluster, and
> > > > submit
> > > > > > the
> > > > > > > > job)
> > > > > > > > > is not that convenient.
> > > > > > > > > 2. One of our motivations is being able to manage Flink
> > > > > applications'
> > > > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds
> not
> > > > > aligned
> > > > > > > with
> > > > > > > > > this motivation.
> > > > > > > > > I think it's probably worth it to support submitting jobs
> via
> > > > > > kubectl &
> > > > > > > > CR
> > > > > > > > > in the first version, both together with deploying the
> > cluster
> > > > like
> > > > > > in
> > > > > > > > > perjob/application mode and after deploying the cluster
> like
> > in
> > > > > > session
> > > > > > > > > mode.
> > > > > > > > >
> > > > > > > >
> > > > > > > > The intention is to support application management through
> > operator
> > > > > and
> > > > > > > CR,
> > > > > > > > which means there won't be any 2 step submission process,
> > which as
> > > > > you
> > > > > > > > allude to would defeat the purpose of this project. The CR
> > example
> > > > > > shows
> > > > > > > > the application part. Please note that the bare cluster
> > support is
> > > > an
> > > > > > > > *additional* feature for scenarios that require external job
> > > > > > management.
> > > > > > > Is
> > > > > > > > there anything on the FLIP page that creates a different
> > > > impression?
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > ## Versioning
> > > > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > > > 1. Native K8s deployment was firstly introduced in Flink
> 1.10
> > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > > > > 4. There was some changes to the Flink docker image
> > entrypoint
> > > > > script
> > > > > > > in,
> > > > > > > > > IIRC, Flink 1.13
> > > > > > > > >
> > > > > > > >
> > > > > > > > Great, thanks for providing this. It is important for the
> > > > > compatibility
> > > > > > > > going forward also. We are targeting Flink 1.14.x upwards.
> > Before
> > > > the
> > > > > > > > operator is ready there will be another Flink release. Let's
> > see if
> > > > > > > anyone
> > > > > > > > is interested in earlier versions?
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > ## Compatibility
> > > > > > > > > What kind of API compatibility we can commit to? It's
> > probably
> > > > fine
> > > > > > to
> > > > > > > > have
> > > > > > > > > alpha / beta version APIs that allow incompatible future
> > changes
> > > > > for
> > > > > > > the
> > > > > > > > > first version. But eventually we would need to guarantee
> > > > backwards
> > > > > > > > > compatibility, so that an early version CR can work with a
> > new
> > > > > > version
> > > > > > > > > operator.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Another great point and please let me include that on the
> FLIP
> > > > page.
> > > > > > ;-)
> > > > > > > >
> > > > > > > > I think we should allow incompatible changes for the first
> one
> > or
> > > > two
> > > > > > > > versions, similar to how other major features have evolved
> > > > recently,
> > > > > > such
> > > > > > > > as FLIP-27.
> > > > > > > >
> > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> [email protected]
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback!
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > Maybe we should make this more clear in the FLIP but we
> > > > agreed
> > > > > to
> > > > > > > do
> > > > > > > > > the
> > > > > > > > > > > first version of the operator based on the native
> > > > integration.
> > > > > > > > > > > While this clearly does not cover all use-cases and
> > > > > requirements,
> > > > > > > it
> > > > > > > > > > seems
> > > > > > > > > > > this would lead to a much smaller initial effort and a
> > nicer
> > > > > > first
> > > > > > > > > > version.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm also leaning towards the native integration, as long
> > as it
> > > > > > > reduces
> > > > > > > > > the
> > > > > > > > > > MVP effort. Ultimately the operator will need to also
> > support
> > > > the
> > > > > > > > > > standalone mode. I would like to gain more confidence
> that
> > > > native
> > > > > > > > > > integration reduces the effort. While it cuts the effort
> to
> > > > > handle
> > > > > > > the
> > > > > > > > TM
> > > > > > > > > > pod creation, some mapping code from the CR to the native
> > > > > > integration
> > > > > > > > > > client and config needs to be created. As mentioned in
> the
> > > > FLIP,
> > > > > > > native
> > > > > > > > > > integration requires the Flink job manager to have access
> > to
> > > > the
> > > > > > k8s
> > > > > > > > API
> > > > > > > > > to
> > > > > > > > > > create pods, which in some scenarios may be seen as
> > > > unfavorable.
> > > > > > > > > >
> > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > Is the pod template in CR same with what Flink has
> > > > already
> > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > cpu/memory
> > > > > > > > resources)
> > > > > > > > > > > could
> > > > > > > > > > > > > take effect.
> > > > > > > > > >
> > > > > > > > > > Yes, pod template would look almost identical. There are
> a
> > few
> > > > > > > settings
> > > > > > > > > > that the operator will control (and that may need to be
> > > > > > blacklisted),
> > > > > > > > but
> > > > > > > > > > in general we would not want to place restrictions. I
> > think a
> > > > > > > mechanism
> > > > > > > > > > where a pod template is merged from multiple layers would
> > also
> > > > be
> > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Thomas
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Reply via email to