Thanks Rui for the update! Alongside with the refactoring to decouple autoscaler logic from the deployment logic, are we planning to add an alternative implementation against the new interfaces? I think the best way to get the interfaces right, is to have an alternative implementation in addition to Kubernetes. YARN or a standalone mode implementation were already mentioned. Ultimately, this is the reason we are doing the refactoring. Without a new implementation, it becomes harder to justify the refactoring work.
Cheers, Max On Tue, Sep 5, 2023 at 9:48 AM Rui Fan <fan...@apache.org> wrote: > > After discussing this FLIP-334[1] offline with Gyula and Max, > I updated the FLIP based on the latest conclusion. > > Big thanks to Gyula and Max for their professional advice! > > > Does the interface function of handlerRecommendedParallelism > > in AutoScalerEventHandler conflict with > > handlerScalingFailure/handlerScalingReport (one of the > > handles the event of scale failure, and the other handles > > the event of scale success). > Hi Matt, > > You can take a look at the FLIP, I think the issue has been fixed. > Currently, we introduced the ScalingRealizer and > AutoScalerEventHandler interface. > > The ScalingRealizer handles scaling action. > > The AutoScalerEventHandler interface handles loggable events. > > > Looking forward to your feedback, thanks! > > [1] https://cwiki.apache.org/confluence/x/x4qzDw > > Best, > Rui > > On Thu, Aug 24, 2023 at 10:55 AM Matt Wang <wang...@163.com> wrote: >> >> Sorry for the late reply, I still have a small question here: >> Does the interface function of handlerRecommendedParallelism >> in AutoScalerEventHandler conflict with >> handlerScalingFailure/handlerScalingReport (one of the >> handles the event of scale failure, and the other handles >> the event of scale success). >> >> >> >> -- >> >> Best, >> Matt Wang >> >> >> ---- Replied Message ---- >> | From | Rui Fan<1996fan...@gmail.com> | >> | Date | 08/21/2023 17:41 | >> | To | <dev@flink.apache.org> | >> | Cc | Maximilian Michels<m...@apache.org> , >> Gyula Fóra<gyula.f...@gmail.com> , >> Matt Wang<wang...@163.com> | >> | Subject | Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes | >> Hi Max, Gyula and Matt, >> >> Do you have any other comments? >> >> The flink-kubernetes-operator 1.6 has been released recently, >> it's a good time to kick off this FLIP. >> >> Please let me know if you have any questions or concerns, >> looking forward to your feedback, thanks! >> >> Best, >> Rui >> >> On Wed, Aug 9, 2023 at 11:55 AM Rui Fan <1996fan...@gmail.com> wrote: >> >> Hi Matt Wang, >> >> Thanks for your discussion here. >> >> it is recommended to unify the descriptions of AutoScalerHandler >> and AutoScalerEventHandler in the FLIP >> >> Good catch, I have updated all AutoScalerHandler to >> AutoScalerEventHandler. >> >> Can it support the use of zookeeper (zookeeper is a relatively >> common use of flink HA)? >> >> In my opinion, it's a good suggestion. However, I prefer we >> implement other state stores in the other FLINK JIRA, and >> this FLIP focus on the decoupling and implementing the >> necessary state store. Does that make sense? >> >> Regarding each scaling information, can it be persisted in >> the shared file system through the filesystem? I think it will >> be a more valuable requirement to support viewing >> Autoscaling info on the UI in the future, which can provide >> some foundations in advance; >> >> This is a good suggestion as well. It's useful for users to check >> the scaling information. I propose to add a CompositeEventHandler, >> it can include multiple EventHandlers. >> >> However, as the last question, I prefer we implement other >> event handler in the other FLINK JIRA. What do you think? >> >> A solution mentioned in FLIP is to initialize the >> AutoScalerEventHandler object every time an event is >> processed. >> >> No, the FLIP mentioned `The AutoScalerEventHandler object is shared for >> all flink jobs`, >> So the AutoScalerEventHandler is only initialized once. >> >> And we call the AutoScalerEventHandler#handlerXXX >> every time an event is processed. >> >> Best, >> Rui >> >> On Tue, Aug 8, 2023 at 9:40 PM Matt Wang <wang...@163.com> wrote: >> >> Hi Rui >> >> Thanks for driving the FLIP. >> >> I agree with the point fo this FLIP. This FLIP first provides a >> general function of Autoscaler in Flink repo, and there is no >> need to move kubernetes-autoscaler from kubernetes-operator >> to Flink repo in this FLIP(it is recommended to unify the >> descriptions of AutoScalerHandler and AutoScalerEventHandler >> in the FLIP). Here I still have a few questions: >> >> 1. AutoScalerStateStore mainly records the state information >> during Scaling. In addition to supporting the use of configmap, >> can it support the use of zookeeper (zookeeper is a relatively >> common use of flink HA)? >> 2. Regarding each scaling information, can it be persisted in >> the shared file system through the filesystem? I think it will >> be a more valuable requirement to support viewing >> Autoscaling info on the UI in the future, which can provide >> some foundations in advance; >> 3. A solution mentioned in FLIP is to initialize the >> AutoScalerEventHandler object every time an event is >> processed. What is the main purpose of this solution? >> >> >> >> -- >> >> Best, >> Matt Wang >> >> >> ---- Replied Message ---- >> | From | Rui Fan<1996fan...@gmail.com> | >> | Date | 08/7/2023 11:34 | >> | To | <dev@flink.apache.org> | >> | Cc | m...@apache.org<m...@apache.org> , >> Gyula Fóra<gyula.f...@gmail.com> | >> | Subject | Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes >> | >> Hi Ron: >> >> Thanks for the feedback! The goal is indeed to turn the autoscaler into >> a general tool that can support other resource management. >> >> >> Hi Max, Gyula: >> >> My proposed `AutoScalerStateStore` is similar to Map, it can really be >> improved. >> >> public interface AutoScalerStateStore { >> Map<String, String> getState(KEY jobKey) >> void updateState(KEY jobKey, Map<String, String> state); >> } >> >> From the method parameter, the StateStore is shared by all jobs, right? >> If yes, the `KEY jobKey` isn't enough because the CR is needed during >> creating the ConfigMap. The `jobKey` is ResourceID, the extraJobInfo is >> CR. >> >> So, this parameter may need to be changed from `KEY jobKey` to >> `JobAutoScalerContext<KEY, INFO>`. Does that make sense? >> If yes, I can update the interface in the FLIP doc. >> >> Best, >> Rui >> >> On Mon, Aug 7, 2023 at 10:18 AM liu ron <ron9....@gmail.com> wrote: >> >> Hi, Rui >> >> Thanks for driving the FLIP. >> >> The tuning of streaming jobs by autoscaler is very important. Although the >> mainstream trend now is cloud-native, many companies still run their Flink >> jobs on Yarn for historical reasons. If we can decouple autoscaler from >> K8S >> and turn it into a common tool that can support other resource management >> frameworks such as Yarn, I think it will be very meaningful. >> +1 for this proposal. >> >> Best, >> Ron >> >> >> Gyula Fóra <gyula.f...@gmail.com> 于2023年8月5日周六 15:03写道: >> >> Hi Rui! >> >> Thanks for the proposal. >> >> I agree with Max on that the state store abstractions could be improved >> and >> be more specific as we know what goes into the state. It could simply be >> >> public interface AutoScalerStateStore { >> Map<String, String> getState(KEY jobKey) >> void updateState(KEY jobKey, Map<String, String> state); >> } >> >> >> We could also remove the entire recommended parallelism logic from the >> interface and make it internal to the implementation somehow because it's >> not very nice in the current form. >> >> Cheers, >> Gyula >> >> On Fri, Aug 4, 2023 at 7:05 AM Rui Fan <1996fan...@gmail.com> wrote: >> >> Hi Max, >> >> After careful consideration, I prefer to keep the AutoScalerStateStore >> instead of AutoScalerEventHandler taking over the work of >> AutoScalerStateStore. And the following are some reasons: >> >> 1. Keeping the AutoScalerStateStore to make StateStore easy to plug in. >> >> Currently, the kubernetes-operator-autoscaler uses the ConfigMap as the >> state store. However, users may use a different state store for >> yarn-autoscaler or generic autoscaler. Such as: MySQL StateStore, >> Heaped StateStore or PostgreSQL StateStore, etc. >> >> Of course, kubernetes autoscaler can also use the MySQL StateStore. >> If the AutoScalerEventHandler is responsible for recording events, >> scaling job and accessing state, whenever users or community want to >> create a new state store, they must also implement the new >> AutoScalerEventHandler, it includes recording events and scaling job. >> >> If we decouple AutoScalerEventHandler and AutoScalerStateStore, >> it's easy to implement a new state store. >> >> 2. AutoScalerEventHandler isn't suitable for access state. >> >> Sometimes the generic autoscaler needs to query state, >> AutoScalerEventHandler can update the state during handling events. >> However, it's wired to provide a series of interfaces to query state. >> >> What do you think? >> >> And looking forward to more thoughts from the community, thanks! >> >> Best, >> Rui Fan >> >> On Tue, Aug 1, 2023 at 11:47 PM Rui Fan <1996fan...@gmail.com> wrote: >> >> Hi Max, >> >> Thanks for your quick response! >> >> 1. Handle state in the AutoScalerEventHandler which will receive >> all related scaling and metric collection events, and can keep >> track of any state. >> >> If I understand correctly, you mean that updating state is just part >> of >> handling events, right? >> >> If yes, sounds make sense. However, I have some concerns: >> >> - Currently, we have 3 key-values that need to be stored. And the >> autoscaler needs to get them first, then update them, and sometimes >> remove them. If we use AutoScalerEventHandler, we need to provided >> 9 methods, right? Every key has 3 methods. >> - Do we add the persistState interface for AutoScalerEventHandler to >> persist in-memory state to kubernetes? >> >> 2. In the long run, the autoscaling logic can move to a >> separate repository, although this will complicate the release >> process, so I would defer this unless there is strong interest. >> >> I also agree to leave it in flink-k8s-operator for now. Unless moving >> it >> out of flink-k8s-operator is necessary in the future. >> >> 3. The proposal mentions some removal of tests. >> >> Sorry, I didn't express clearly in FLIP. POC just check whether these >> interfaces work well. It will take more time if I develop all the >> tests >> during POC. So I removed these tests in my POC. >> >> These tests will be completed in the final PR, and the test is very >> useful >> for less bugs. >> >> Best, >> Rui Fan >> >> On Tue, Aug 1, 2023 at 10:10 PM Maximilian Michels <m...@apache.org> >> wrote: >> >> Hi Rui, >> >> Thanks for the proposal. I think it makes a lot of sense to decouple >> the autoscaler from Kubernetes-related dependencies. A couple of >> notes >> when I read the proposal: >> >> 1. You propose AutoScalerEventHandler, AutoScalerStateStore, >> AutoScalerStateStoreFactory, and AutoScalerEventHandler. >> AutoscalerStateStore is a generic key/value database (methods: >> "get"/"put"/"delete"). I would propose to refine this interface and >> make it less general purpose, e.g. add a method for persisting >> scaling >> decisions as well as any metrics gathered for the current metric >> window. For simplicity, I'd even go so far to remove the state store >> entirely, but rather handle state in the AutoScalerEventHandler >> which >> will receive all related scaling and metric collection events, and >> can >> keep track of any state. >> >> 2. You propose to make the current autoscaler module >> Kubernetes-agnostic by moving the Kubernetes parts into the main >> operator module. I think that makes sense since the Kubernetes >> implementation will continue to be tightly coupled with Kubernetes. >> The goal of the separate module was to make the autoscaler logic >> pluggable, but this will continue to be possible with the new >> "flink-autoscaler" module which contains the autoscaling logic and >> interfaces. In the long run, the autoscaling logic can move to a >> separate repository, although this will complicate the release >> process, so I would defer this unless there is strong interest. >> >> 3. The proposal mentions some removal of tests. It is critical for >> us >> that all test coverage of the current implementation remains active. >> It is ok if some of the test coverage only covers the Kubernetes >> implementation. We can eventually move more tests without Kubernetes >> significance into the implementation-agnostic autoscaler tests. >> >> -Max >> >> On Tue, Aug 1, 2023 at 9:46 AM Rui Fan <fan...@apache.org> wrote: >> >> Hi all, >> >> I and Samrat(cc'ed) created the FLIP-334[1] to decoupling the >> autoscaler >> and kubernetes. >> >> Currently, the flink-autoscaler is tightly integrated with >> Kubernetes. >> There are compelling reasons to extend the use of flink-autoscaler >> to >> more types of Flink jobs: >> 1. With the recent merge of the Externalized Declarative Resource >> Management (FLIP-291[2]), in-place scaling is now supported >> across all types of Flink jobs. This development has made scaling >> Flink >> on >> YARN a straightforward process. >> 2. Several discussions[3] within the Flink user community, as >> observed >> in >> the mail list , have emphasized the necessity of flink-autoscaler >> supporting >> Flink on YARN. >> >> Please refer to the FLIP[1] document for more details about the >> proposed >> design and implementation. We welcome any feedback and opinions on >> this proposal. >> >> [1] https://cwiki.apache.org/confluence/x/x4qzDw >> [2] >> >> >> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management >> [3] >> https://lists.apache.org/thread/pr0r8hq8kqpzk3q1zrzkl3rp1lz24v7v >> >> >> >> >> >>