From the ScalingRealizer, I think having before/after hooks for `realizeParallelismOverrides` and `realizeConfigOverrides` would be good. We can support these hooks from plugins, thoughts?
Best, Diljeet(DJ) Singh On 2025/08/26 08:24:33 Maximilian Michels wrote: > Hi Peter, > > First of all, this is a great initiative. Flink Autoscaling definitely > needs more points of extension. We recently added support for hooking > into the metric evaluation (FLIP-514), but clearly that is just one > extension point. > > That said, I think we will need to revise the approach a bit. I'm not > sure, we should be replacing core components. As Gyula mentioned, > replacing those will easily break the entire autoscaler. Instead, we > should be adding extension points which allow for meaningful additions > without breaking the scaling logic. There is already the option to > replace the entire autoscaling module, if users really want to roll > out a completely custom version. > > What usually works best is to formulate the use case first, then > figure out what autoscaler customization would be necessary to > implement the use case. > > As for making the ScalingRealizer pluggable > (https://github.com/apache/flink-kubernetes-operator/pull/1020/files), > I do think that makes sense for some scenarios. > > Cheers, > Max > > On Tue, Aug 26, 2025 at 8:59 AM Gyula Fóra <gy...@gmail.com> wrote: > > > > Hi Peter & Diljeet! > > > > My general feedback is that we should try to introduce extension plugins > > instead of plugins that completely replace key parts of the autoscaler code. > > > > Let me give you a concrete example through FLIP-514 and FLIP-543 using the > > MetricsEvaluator pluggability. > > The MetricsEvaluator in the autoscaler is responsible for > > evaluating/deriving/calculating metrics from the collected metrics. It has > > to calculate everything in a more or less specific way otherwise other > > parts of the autoscaler that depend on these metrics may not work. It > > doesn't seem very practical/resonable to completely reimplement this just > > because someone wants to extend the logic, this is extremely error prone > > and fragile especially if the autoscaler logic later evolves. > > > > FLIP-514 takes the approach to extend the metric evaluator with a new > > method that allows users to at the end modify the evaluated metrics and > > define custom ones. This is the right approach here as it makes a new > > extension very simple to build and maintain without interfering with > > existing logic. > > > > The approach in FLIP-543 and in Diljeet's example PR takes the replacement > > approach to completely substitute the entire parts of the implementation > > (the entire evaluator, scaling realizer etc). I think this is not very good > > for either the community or the actual user. From a community perspective > > it makes it harder to extend the logic with nice small additions and from a > > user's perspective it is very error probe if the operator autoscaler logic > > changes as it basically exposes a lot of internal logic on a user interface. > > > > So at this point, -1 for the approach in FLIP-543 from my side, but I > > would love to hear the opinion of others as well. > > > > Cheers > > Gyula > > > > On Mon, Aug 25, 2025 at 11:44 PM Peter Huang <hu...@gmail.com> wrote: > >> > >> Hi Diljeet, > >> > >> Yes, I think we have similar requirements to make autoscaler even more > >> powerful to handle some customized requirements. > >> The quick PoC makes sense to me. Let's get some more feedback from the > >> community. > >> > >> > >> > >> Best Regards > >> Peter Huang > >> > >> > >> > >> On Mon, Aug 25, 2025 at 2:37 PM Peter Huang <hu...@gmail.com> > >> wrote: > >> > >> > Just try to combine the discussion into one thread. > >> > > >> > @Diljeet Singh > >> > Posted a quick PoC for the proposal > >> > https://github.com/apache/flink-kubernetes-operator/pull/1020. > >> > > >> > > >> > > >> > > >> > On Mon, Aug 25, 2025 at 7:52 AM Peter Huang <hu...@gmail.com> > >> > wrote: > >> > > >> >> Hi Community, > >> >> > >> >> Our org has been heavily using the Flink autoscaling algorithm. It > >> >> greatly reduced our operation overhead and improved cost efficiency > >> >> as users always over provision resources when onboard. Recently, we have > >> >> had some requirements to customize the auto scaling algorithm > >> >> for different scenarios, for example, during the holiday season large > >> >> but > >> >> predictable traffic spike, increase checkpoint interval together with > >> >> scale up for streaming ingestion use cases. > >> >> > >> >> We search through the discussion about the topic in the mail list > >> >> including the existing FLIP-514 > >> >> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-514%3A+Custom+Evaluator+plugin+for+Flink+Autoscaler>. > >> >> Looks like the discussion is not finalized yet. > >> >> To accelerate the process, we adopt and combine the > >> >> existing opinions from the community and create a proposal in FLIP-543 > >> >> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-543%3A+Support+Customized+Autoscale+Algorithm>. > >> >> The basic idea > >> >> is to make some core components of autoscaler pluggable, for example, > >> >> MetricsCollector, Metrics Evaluator, and ScalingRealizer, at the same > >> >> keep the core logic skeleton (which is already well justified in large > >> >> amount of users) of autoscaler untouched. > >> >> > >> >> Looking forward to any feedback and opinions on FLIP-543. > >> >> > >> >> [1] > >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-543%3A+Support+Customized+Autoscale+Algorithm > >> >> [2] > >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-514%3A+Custom+Evaluator+plugin+for+Flink+Autoscaler > >> >> [3] Other related discussion thread > >> >> > >> >> https://lists.apache.org/thread/749l74z1h5jylkxrw3rtjmxcj2t9p7ws > >> >> > >> >> https://lists.apache.org/thread/mcd7jcn4kz6oqtyqq5hfycjf9mqh6c53 > >> >> > >> >> > >> >> Best Regards > >> >> Peter Huang > >> >> > >> > >