It is worth looking at the controller code in the spotify operator too: https://github.com/spotify/flink-on-k8s-operator/blob/master/controllers/flinkcluster/flinkcluster_controller.go
It is looping in the 'observer phase' until it reaches a stable state, then it performs the necessary changes. Based on this I also suggest keeping the logic in separate modules(Validate->Observe->Reconcile). The three components might not even be enough as we add more and more complexity to the code. Cheers, Matyas On Mon, Feb 28, 2022 at 2:03 PM Aitozi <gjying1...@gmail.com> wrote: > Hi, Gyula > Thanks for driving this discussion. I second Yang Wang's idea that > it's better to make the `validator`, `observer` and `reconciler` > self-contained. I also prefer to define the `Observer` as an interface and > we could define the statuses that `Observer` will expose. It acts like the > observer protocol between the `Observer` and `Reconciler`. > > Best, > Aitozi. > > Yang Wang <danrtsey...@gmail.com> 于2022年2月28日周一 16:28写道: > > > Thanks for posting the discussion here. > > > > > > Having the components `validator` `observer` `reconciler` makes lots of > > sense. And the "Validate -> Observe -> Reconcile" > > flow seems natural to me. > > > > Regarding the implementation in the PR, instead of directly using the > > observer in the reconciler, I lean to let the observer > > exports the results to the status(e.g. jobmanager deployment status, rest > > port readiness, flink jobs status, etc.) and > > the reconciler reads it from the status. Then each component is more > > self-contained and the boundary will be clearer. > > > > > > Best, > > Yang > > > > Gyula Fóra <gyf...@apache.org> 于2022年2月28日周一 16:01写道: > > > > > Hi All! > > > > > > I would like to start a discussion thread regarding the structure of > > > the Kubernetes > > > Operator > > > < > > > > > > https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java > > > > > > > controller > > > flow. Based on some recent PR discussions we have no clear consensus on > > the > > > structure and the expectations which can potentially lead to back and > > forth > > > changes and unnecessary complexity. > > > > > > *Background* > > > In the initial prototype we had a very basic flow: > > > 1. Observe flink job status > > > 2. (if observation successful) reconcile changes > > > 3. Reschedule reconcile with success/error > > > > > > This basic prototype flow could not cover all requirements and did not > > > allow for things like waiting until Jobmanager deployment is ready etc. > > > > > > To solve these shortcomings, some changes were introduced recently here > > > <https://github.com/apache/flink-kubernetes-operator/pull/21>. While > > this > > > change introduced many improvements and safeguards it also completely > > > changed the original controller flow. Now the reconciler is responsible > > for > > > ensuring that it can actually reconcile by checking the deployment and > > > ports. The job status observation logic has also been moved into the > > actual > > > reconcile logic. > > > > > > > > > *Discussion Question*What controller flow would we like to have? Do we > > want > > > to separate the observer from the reconciler or keep them together? > > > > > > In my personal view, we should try to adopt a very simple flow to make > > the > > > operator clean and modular. If possible I would like to restore the > > > original flow with some modifications: > > > > > > 1. Validate deployment object > > > 2. Observe deployment and flink job status -> Return comprehensive > > status > > > info > > > 3. Reconcile deployment based on observed status and resource changes > > > (Both 2/3 should be able to reschedule immediately if necessary) > > > > > > I think the Observer component should be able to describe the current > > > status of the deployment objects and the flink job to the extent that > the > > > reconciler can work with that information alone. If we do it this way, > we > > > can also use the status information that the observer provides to > produce > > > other events and aid operations like shutdown which depend on the > current > > > deployment status. > > > > > > I think this would satisfy our needs, but I might be missing something > > that > > > cannot be done if we structure the code this way. > > > > > > I have a PR open > > > <https://github.com/apache/flink-kubernetes-operator/pull/26/commits> > > > which > > > includes some of these proposed changes (as the optional second commit) > > so > > > that you can easily compare with the current state of the operator. > > > > > > Please let us know what we think! > > > > > > Cheers, > > > Gyula > > > > > >