Re: [DISCUSS] Flink Kubernetes Operator controller flow

Biao Geng Mon, 28 Feb 2022 07:49:08 -0800

Hi Gyula,
Thanks for the discussion. It also makes senses to me on the separation of
3 components and Yang's proposal.
Just 1 follow-up thought after checking your PR: in the reconcile()
method of controller, IIUC, the real flow could be
`validate->observe->reconcile->validate->observe->reconcile...". The
validation phase seems to be required only when the creation of the job
cluster and the upgrade of config. For phases like waiting the JM from
deploying to ready, it is not mandatory and thus the flow can look like
`validate->observe->reconcile->optional validate due to current
state->observe->reconcile...`


Őrhidi Mátyás <[email protected]> 于2022年2月28日周一 21:26写道：

> It is worth looking at the controller code in the spotify operator too:
>
> https://github.com/spotify/flink-on-k8s-operator/blob/master/controllers/flinkcluster/flinkcluster_controller.go
>
> It is looping in the 'observer phase' until it reaches a stable state, then
> it performs the necessary changes.
>
> Based on this I also suggest keeping the logic in separate
> modules(Validate->Observe->Reconcile). The three components might not
> even be enough as we add more and more complexity to the code.
>
> Cheers,
> Matyas
>
>
> On Mon, Feb 28, 2022 at 2:03 PM Aitozi <[email protected]> wrote:
>
> > Hi, Gyula
> >       Thanks for driving this discussion. I second Yang Wang's idea that
> > it's better to make the `validator`, `observer` and `reconciler`
> > self-contained. I also prefer to define the `Observer` as an interface
> and
> > we could define the statuses that `Observer` will expose. It acts like
> the
> > observer protocol between the `Observer` and `Reconciler`.
> >
> > Best,
> > Aitozi.
> >
> > Yang Wang <[email protected]> 于2022年2月28日周一 16:28写道：
> >
> > > Thanks for posting the discussion here.
> > >
> > >
> > > Having the components `validator` `observer` `reconciler` makes lots of
> > > sense. And the "Validate -> Observe -> Reconcile"
> > > flow seems natural to me.
> > >
> > > Regarding the implementation in the PR, instead of directly using the
> > > observer in the reconciler, I lean to let the observer
> > > exports the results to the status(e.g. jobmanager deployment status,
> rest
> > > port readiness, flink jobs status, etc.) and
> > > the reconciler reads it from the status. Then each component is more
> > > self-contained and the boundary will be clearer.
> > >
> > >
> > > Best,
> > > Yang
> > >
> > > Gyula Fóra <[email protected]> 于2022年2月28日周一 16:01写道：
> > >
> > > > Hi All!
> > > >
> > > > I would like to start a discussion thread regarding the structure of
> > > > the Kubernetes
> > > > Operator
> > > > <
> > > >
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
> > > > >
> > > > controller
> > > > flow. Based on some recent PR discussions we have no clear consensus
> on
> > > the
> > > > structure and the expectations which can potentially lead to back and
> > > forth
> > > > changes and unnecessary complexity.
> > > >
> > > > *Background*
> > > > In the initial prototype we had a very basic flow:
> > > >  1. Observe flink job status
> > > >  2. (if observation successful) reconcile changes
> > > >  3. Reschedule reconcile with success/error
> > > >
> > > > This basic prototype flow could not cover all requirements and did
> not
> > > > allow for things like waiting until Jobmanager deployment is ready
> etc.
> > > >
> > > > To solve these shortcomings, some changes were introduced recently
> here
> > > > <https://github.com/apache/flink-kubernetes-operator/pull/21>. While
> > > this
> > > > change introduced many improvements and safeguards it also completely
> > > > changed the original controller flow. Now the reconciler is
> responsible
> > > for
> > > > ensuring that it can actually reconcile by checking the deployment
> and
> > > > ports. The job status observation logic has also been moved into the
> > > actual
> > > > reconcile logic.
> > > >
> > > >
> > > > *Discussion Question*What controller flow would we like to have? Do
> we
> > > want
> > > > to separate the observer from the reconciler or keep them together?
> > > >
> > > > In my personal view, we should try to adopt a very simple flow to
> make
> > > the
> > > > operator clean and modular. If possible I would like to restore the
> > > > original flow with some modifications:
> > > >
> > > >  1. Validate deployment object
> > > >  2. Observe deployment and flink job status -> Return comprehensive
> > > status
> > > > info
> > > >  3. Reconcile deployment based on observed status and resource
> changes
> > > >  (Both 2/3 should be able to reschedule immediately if necessary)
> > > >
> > > > I think the Observer component should be able to describe the current
> > > > status of the deployment objects and the flink job to the extent that
> > the
> > > > reconciler can work with that information alone. If we do it this
> way,
> > we
> > > > can also use the status information that the observer provides to
> > produce
> > > > other events and aid operations like shutdown which depend on the
> > current
> > > > deployment status.
> > > >
> > > > I think this would satisfy our needs, but I might be missing
> something
> > > that
> > > > cannot be done if we structure the code this way.
> > > >
> > > > I have a PR open
> > > > <https://github.com/apache/flink-kubernetes-operator/pull/26/commits
> >
> > > > which
> > > > includes some of these proposed changes (as the optional second
> commit)
> > > so
> > > > that you can easily compare with the current state of the operator.
> > > >
> > > > Please let us know what we think!
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > >
> >
>

Re: [DISCUSS] Flink Kubernetes Operator controller flow

Reply via email to