Re: [DISCUSS] Flink Kubernetes Operator controller flow

Őrhidi Mátyás Mon, 28 Feb 2022 05:26:42 -0800

It is worth looking at the controller code in the spotify operator too:
https://github.com/spotify/flink-on-k8s-operator/blob/master/controllers/flinkcluster/flinkcluster_controller.go


It is looping in the 'observer phase' until it reaches a stable state, then
it performs the necessary changes.

Based on this I also suggest keeping the logic in separate
modules(Validate->Observe->Reconcile). The three components might not
even be enough as we add more and more complexity to the code.

Cheers,
Matyas


On Mon, Feb 28, 2022 at 2:03 PM Aitozi <gjying1...@gmail.com> wrote:

> Hi, Gyula
>       Thanks for driving this discussion. I second Yang Wang's idea that
> it's better to make the `validator`, `observer` and `reconciler`
> self-contained. I also prefer to define the `Observer` as an interface and
> we could define the statuses that `Observer` will expose. It acts like the
> observer protocol between the `Observer` and `Reconciler`.
>
> Best,
> Aitozi.
>
> Yang Wang <danrtsey...@gmail.com> 于2022年2月28日周一 16:28写道：
>
> > Thanks for posting the discussion here.
> >
> >
> > Having the components `validator` `observer` `reconciler` makes lots of
> > sense. And the "Validate -> Observe -> Reconcile"
> > flow seems natural to me.
> >
> > Regarding the implementation in the PR, instead of directly using the
> > observer in the reconciler, I lean to let the observer
> > exports the results to the status(e.g. jobmanager deployment status, rest
> > port readiness, flink jobs status, etc.) and
> > the reconciler reads it from the status. Then each component is more
> > self-contained and the boundary will be clearer.
> >
> >
> > Best,
> > Yang
> >
> > Gyula Fóra <gyf...@apache.org> 于2022年2月28日周一 16:01写道：
> >
> > > Hi All!
> > >
> > > I would like to start a discussion thread regarding the structure of
> > > the Kubernetes
> > > Operator
> > > <
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
> > > >
> > > controller
> > > flow. Based on some recent PR discussions we have no clear consensus on
> > the
> > > structure and the expectations which can potentially lead to back and
> > forth
> > > changes and unnecessary complexity.
> > >
> > > *Background*
> > > In the initial prototype we had a very basic flow:
> > >  1. Observe flink job status
> > >  2. (if observation successful) reconcile changes
> > >  3. Reschedule reconcile with success/error
> > >
> > > This basic prototype flow could not cover all requirements and did not
> > > allow for things like waiting until Jobmanager deployment is ready etc.
> > >
> > > To solve these shortcomings, some changes were introduced recently here
> > > <https://github.com/apache/flink-kubernetes-operator/pull/21>. While
> > this
> > > change introduced many improvements and safeguards it also completely
> > > changed the original controller flow. Now the reconciler is responsible
> > for
> > > ensuring that it can actually reconcile by checking the deployment and
> > > ports. The job status observation logic has also been moved into the
> > actual
> > > reconcile logic.
> > >
> > >
> > > *Discussion Question*What controller flow would we like to have? Do we
> > want
> > > to separate the observer from the reconciler or keep them together?
> > >
> > > In my personal view, we should try to adopt a very simple flow to make
> > the
> > > operator clean and modular. If possible I would like to restore the
> > > original flow with some modifications:
> > >
> > >  1. Validate deployment object
> > >  2. Observe deployment and flink job status -> Return comprehensive
> > status
> > > info
> > >  3. Reconcile deployment based on observed status and resource changes
> > >  (Both 2/3 should be able to reschedule immediately if necessary)
> > >
> > > I think the Observer component should be able to describe the current
> > > status of the deployment objects and the flink job to the extent that
> the
> > > reconciler can work with that information alone. If we do it this way,
> we
> > > can also use the status information that the observer provides to
> produce
> > > other events and aid operations like shutdown which depend on the
> current
> > > deployment status.
> > >
> > > I think this would satisfy our needs, but I might be missing something
> > that
> > > cannot be done if we structure the code this way.
> > >
> > > I have a PR open
> > > <https://github.com/apache/flink-kubernetes-operator/pull/26/commits>
> > > which
> > > includes some of these proposed changes (as the optional second commit)
> > so
> > > that you can easily compare with the current state of the operator.
> > >
> > > Please let us know what we think!
> > >
> > > Cheers,
> > > Gyula
> > >
> >
>

Re: [DISCUSS] Flink Kubernetes Operator controller flow

Reply via email to