Hi, Rui Thanks for driving the FLIP.
The tuning of streaming jobs by autoscaler is very important. Although the mainstream trend now is cloud-native, many companies still run their Flink jobs on Yarn for historical reasons. If we can decouple autoscaler from K8S and turn it into a common tool that can support other resource management frameworks such as Yarn, I think it will be very meaningful. +1 for this proposal. Best, Ron Gyula Fóra <gyula.f...@gmail.com> 于2023年8月5日周六 15:03写道: > Hi Rui! > > Thanks for the proposal. > > I agree with Max on that the state store abstractions could be improved and > be more specific as we know what goes into the state. It could simply be > > public interface AutoScalerStateStore { > Map<String, String> getState(KEY jobKey) > void updateState(KEY jobKey, Map<String, String> state); > } > > > We could also remove the entire recommended parallelism logic from the > interface and make it internal to the implementation somehow because it's > not very nice in the current form. > > Cheers, > Gyula > > On Fri, Aug 4, 2023 at 7:05 AM Rui Fan <1996fan...@gmail.com> wrote: > > > Hi Max, > > > > After careful consideration, I prefer to keep the AutoScalerStateStore > > instead of AutoScalerEventHandler taking over the work of > > AutoScalerStateStore. And the following are some reasons: > > > > 1. Keeping the AutoScalerStateStore to make StateStore easy to plug in. > > > > Currently, the kubernetes-operator-autoscaler uses the ConfigMap as the > > state store. However, users may use a different state store for > > yarn-autoscaler or generic autoscaler. Such as: MySQL StateStore, > > Heaped StateStore or PostgreSQL StateStore, etc. > > > > Of course, kubernetes autoscaler can also use the MySQL StateStore. > > If the AutoScalerEventHandler is responsible for recording events, > > scaling job and accessing state, whenever users or community want to > > create a new state store, they must also implement the new > > AutoScalerEventHandler, it includes recording events and scaling job. > > > > If we decouple AutoScalerEventHandler and AutoScalerStateStore, > > it's easy to implement a new state store. > > > > 2. AutoScalerEventHandler isn't suitable for access state. > > > > Sometimes the generic autoscaler needs to query state, > > AutoScalerEventHandler can update the state during handling events. > > However, it's wired to provide a series of interfaces to query state. > > > > What do you think? > > > > And looking forward to more thoughts from the community, thanks! > > > > Best, > > Rui Fan > > > > On Tue, Aug 1, 2023 at 11:47 PM Rui Fan <1996fan...@gmail.com> wrote: > > > > > Hi Max, > > > > > > Thanks for your quick response! > > > > > > > 1. Handle state in the AutoScalerEventHandler which will receive > > > > all related scaling and metric collection events, and can keep > > > > track of any state. > > > > > > If I understand correctly, you mean that updating state is just part of > > > handling events, right? > > > > > > If yes, sounds make sense. However, I have some concerns: > > > > > > - Currently, we have 3 key-values that need to be stored. And the > > > autoscaler needs to get them first, then update them, and sometimes > > > remove them. If we use AutoScalerEventHandler, we need to provided > > > 9 methods, right? Every key has 3 methods. > > > - Do we add the persistState interface for AutoScalerEventHandler to > > > persist in-memory state to kubernetes? > > > > > > > 2. In the long run, the autoscaling logic can move to a > > > > separate repository, although this will complicate the release > > > > process, so I would defer this unless there is strong interest. > > > > > > I also agree to leave it in flink-k8s-operator for now. Unless moving > it > > > out of flink-k8s-operator is necessary in the future. > > > > > > > 3. The proposal mentions some removal of tests. > > > > > > Sorry, I didn't express clearly in FLIP. POC just check whether these > > > interfaces work well. It will take more time if I develop all the tests > > > during POC. So I removed these tests in my POC. > > > > > > These tests will be completed in the final PR, and the test is very > > useful > > > for less bugs. > > > > > > Best, > > > Rui Fan > > > > > > On Tue, Aug 1, 2023 at 10:10 PM Maximilian Michels <m...@apache.org> > > wrote: > > > > > >> Hi Rui, > > >> > > >> Thanks for the proposal. I think it makes a lot of sense to decouple > > >> the autoscaler from Kubernetes-related dependencies. A couple of notes > > >> when I read the proposal: > > >> > > >> 1. You propose AutoScalerEventHandler, AutoScalerStateStore, > > >> AutoScalerStateStoreFactory, and AutoScalerEventHandler. > > >> AutoscalerStateStore is a generic key/value database (methods: > > >> "get"/"put"/"delete"). I would propose to refine this interface and > > >> make it less general purpose, e.g. add a method for persisting scaling > > >> decisions as well as any metrics gathered for the current metric > > >> window. For simplicity, I'd even go so far to remove the state store > > >> entirely, but rather handle state in the AutoScalerEventHandler which > > >> will receive all related scaling and metric collection events, and can > > >> keep track of any state. > > >> > > >> 2. You propose to make the current autoscaler module > > >> Kubernetes-agnostic by moving the Kubernetes parts into the main > > >> operator module. I think that makes sense since the Kubernetes > > >> implementation will continue to be tightly coupled with Kubernetes. > > >> The goal of the separate module was to make the autoscaler logic > > >> pluggable, but this will continue to be possible with the new > > >> "flink-autoscaler" module which contains the autoscaling logic and > > >> interfaces. In the long run, the autoscaling logic can move to a > > >> separate repository, although this will complicate the release > > >> process, so I would defer this unless there is strong interest. > > >> > > >> 3. The proposal mentions some removal of tests. It is critical for us > > >> that all test coverage of the current implementation remains active. > > >> It is ok if some of the test coverage only covers the Kubernetes > > >> implementation. We can eventually move more tests without Kubernetes > > >> significance into the implementation-agnostic autoscaler tests. > > >> > > >> -Max > > >> > > >> On Tue, Aug 1, 2023 at 9:46 AM Rui Fan <fan...@apache.org> wrote: > > >> > > > >> > Hi all, > > >> > > > >> > I and Samrat(cc'ed) created the FLIP-334[1] to decoupling the > > autoscaler > > >> > and kubernetes. > > >> > > > >> > Currently, the flink-autoscaler is tightly integrated with > Kubernetes. > > >> > There are compelling reasons to extend the use of flink-autoscaler > to > > >> > more types of Flink jobs: > > >> > 1. With the recent merge of the Externalized Declarative Resource > > >> > Management (FLIP-291[2]), in-place scaling is now supported > > >> > across all types of Flink jobs. This development has made scaling > > Flink > > >> on > > >> > YARN a straightforward process. > > >> > 2. Several discussions[3] within the Flink user community, as > observed > > >> in > > >> > the mail list , have emphasized the necessity of flink-autoscaler > > >> > supporting > > >> > Flink on YARN. > > >> > > > >> > Please refer to the FLIP[1] document for more details about the > > proposed > > >> > design and implementation. We welcome any feedback and opinions on > > >> > this proposal. > > >> > > > >> > [1] https://cwiki.apache.org/confluence/x/x4qzDw > > >> > [2] > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management > > >> > [3] > https://lists.apache.org/thread/pr0r8hq8kqpzk3q1zrzkl3rp1lz24v7v > > >> > > > > > >