Hey! I agree that the use-cases outlined here are probably common for large autoscaler users and we should aim for nice well integrated solutions in the core algorithm.
Regarding the predictive scaling example actually for FLIP-514 we had something similar in mind. We recognize that prediction of future increased load is largely use-case / company dependent so the idea with the custom evaluator plugin was to allow for example custom prediction logic to be added that way to modify the scaling targets. Just to zoom in a little more on this particular use-case, I think for a predictive approach we probably don't need pluggable everything. The custom evaluator of FLIP-514 would allow us to add an evaluator implementation that can reach out to external systems and based on the information for example double the ScalingMetric.TARGET_DATA_RATE. Since the target data rate is what controls the scaling target, no further modifications are needed. We intentionally avoided introducing new collected metrics like PREDICTED_DATA_RATE etc, as they do not seem to make the situation easier and actually just complicate the entire logic and arrive at the same result. Data size aware scaling sounds more like dynamic parallelism bounds for certain operators. At least if I understand correctly we want to control iceberg sink parallelisms so that output file sizes remain in a fixed range. If this is a fair assessment then maybe a single extra step in the scaling realizer could solve it. Again probably not necessary to add anything to the metrics collector if we can observe the file sizes directly in a single step. In any case it would be great to talk about solving these specific issues independently in the "best possible way" and then I think we will come to a good solution regarding pluggability that will probably serve future cases well too. Cheers Gyula On Tue, Aug 26, 2025 at 6:23 PM Peter Huang <huangzhenqiu0...@gmail.com> wrote: > Hi Gyula and Max, > > Thanks for the feedback. I totally agree with your concern about the > integrity of the auto scaling algorithms. > Open these interfaces as plugins could allow developers to write > incompatible solutions that are hard to > debug and maintain. My original intention is to support our internal > requirements and make sure these internal > implementations align with upstream. Let's take a look at the original > problem we want to resolve. I may cover > why these extensions can easily resolve them without breaking the core > logic. > > *1. Predictive Auto Scaling UP* > It is common that traffic spikes happen on some special days of the year, > for example, Black Friday and Super Bowl. > From our observations, some topics will have 2 x or 3 x traffic within > several hours. In this case, several scale ups > could happen due to the fast growing traffic pattern. It is very easy to > cause SLA violation or outage for large state Flink jobs. > > Similar to Delayed Scale down. We want to use an additional signal > (predicted spike traffic of the day) to trigger > the scale up early. Together with 24 hours delayed scale down. We want to > seamlessly handle the situation > without breaking auto scaling behavior on regular days. > > With pluginable Metrics Collector, Metrics Evaluator, and Scaling Realizer. > We can override existing ScalingMetricsCollector > < > https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingMetricCollector.java > > > to get peak traffic from external systems (through http/rpc). Normalize the > signal as metrics in MetricsEvaluator, > and in the end use it for final parallelism adjustment in Scaling Realizer. > > *2. Data Size Aware Auto Scaling* > We are also working on streaming ingestion in our org. There is an > additional requirement other than streaming > analytics use cases. It is to make the file size in a reasonable range for > query performance after autoscaling. > The way to achieve it is to tune the checkpoint interval together with > parallelism with additional metrics from > the Hudi/Iceberg connector. > > If we know the total file size of each commit for the last hours or even > longer (given a checkpoint interval), we > may roughly estimate each file size = estimate target total file size / > target parallelism after a rescaling action. > To achieve the goal, we also need to add additional metrics into existing > Metrics Collector, and use it for Scaling > Realizer to change checkpoint intervals after the parallelism change. > > Honestly, I think these two requirements are common enough to > become features upstream. The proposed pluginable > components is one of the solutions to implement them. They could also > achieve through some configure based change > on existing classes similar to Delayed scale down. Look forward to some > feedback from you. > > > Best Regards > Peter Huang > > > > > > > > > > > On Tue, Aug 26, 2025 at 1:24 AM Maximilian Michels <m...@apache.org> wrote: > > > Hi Peter, > > > > First of all, this is a great initiative. Flink Autoscaling definitely > > needs more points of extension. We recently added support for hooking > > into the metric evaluation (FLIP-514), but clearly that is just one > > extension point. > > > > That said, I think we will need to revise the approach a bit. I'm not > > sure, we should be replacing core components. As Gyula mentioned, > > replacing those will easily break the entire autoscaler. Instead, we > > should be adding extension points which allow for meaningful additions > > without breaking the scaling logic. There is already the option to > > replace the entire autoscaling module, if users really want to roll > > out a completely custom version. > > > > What usually works best is to formulate the use case first, then > > figure out what autoscaler customization would be necessary to > > implement the use case. > > > > As for making the ScalingRealizer pluggable > > (https://github.com/apache/flink-kubernetes-operator/pull/1020/files), > > I do think that makes sense for some scenarios. > > > > Cheers, > > Max > > > > On Tue, Aug 26, 2025 at 8:59 AM Gyula Fóra <gyula.f...@gmail.com> wrote: > > > > > > Hi Peter & Diljeet! > > > > > > My general feedback is that we should try to introduce extension > plugins > > instead of plugins that completely replace key parts of the autoscaler > code. > > > > > > Let me give you a concrete example through FLIP-514 and FLIP-543 using > > the MetricsEvaluator pluggability. > > > The MetricsEvaluator in the autoscaler is responsible for > > evaluating/deriving/calculating metrics from the collected metrics. It > has > > to calculate everything in a more or less specific way otherwise other > > parts of the autoscaler that depend on these metrics may not work. It > > doesn't seem very practical/resonable to completely reimplement this just > > because someone wants to extend the logic, this is extremely error prone > > and fragile especially if the autoscaler logic later evolves. > > > > > > FLIP-514 takes the approach to extend the metric evaluator with a new > > method that allows users to at the end modify the evaluated metrics and > > define custom ones. This is the right approach here as it makes a new > > extension very simple to build and maintain without interfering with > > existing logic. > > > > > > The approach in FLIP-543 and in Diljeet's example PR takes the > > replacement approach to completely substitute the entire parts of the > > implementation (the entire evaluator, scaling realizer etc). I think this > > is not very good for either the community or the actual user. From a > > community perspective it makes it harder to extend the logic with nice > > small additions and from a user's perspective it is very error probe if > the > > operator autoscaler logic changes as it basically exposes a lot of > internal > > logic on a user interface. > > > > > > So at this point, -1 for the approach in FLIP-543 from my side, but I > > would love to hear the opinion of others as well. > > > > > > Cheers > > > Gyula > > > > > > On Mon, Aug 25, 2025 at 11:44 PM Peter Huang < > huangzhenqiu0...@gmail.com> > > wrote: > > >> > > >> Hi Diljeet, > > >> > > >> Yes, I think we have similar requirements to make autoscaler even more > > >> powerful to handle some customized requirements. > > >> The quick PoC makes sense to me. Let's get some more feedback from the > > >> community. > > >> > > >> > > >> > > >> Best Regards > > >> Peter Huang > > >> > > >> > > >> > > >> On Mon, Aug 25, 2025 at 2:37 PM Peter Huang < > huangzhenqiu0...@gmail.com > > > > > >> wrote: > > >> > > >> > Just try to combine the discussion into one thread. > > >> > > > >> > @Diljeet Singh > > >> > Posted a quick PoC for the proposal > > >> > https://github.com/apache/flink-kubernetes-operator/pull/1020. > > >> > > > >> > > > >> > > > >> > > > >> > On Mon, Aug 25, 2025 at 7:52 AM Peter Huang < > > huangzhenqiu0...@gmail.com> > > >> > wrote: > > >> > > > >> >> Hi Community, > > >> >> > > >> >> Our org has been heavily using the Flink autoscaling algorithm. It > > >> >> greatly reduced our operation overhead and improved cost efficiency > > >> >> as users always over provision resources when onboard. Recently, we > > have > > >> >> had some requirements to customize the auto scaling algorithm > > >> >> for different scenarios, for example, during the holiday season > > large but > > >> >> predictable traffic spike, increase checkpoint interval together > with > > >> >> scale up for streaming ingestion use cases. > > >> >> > > >> >> We search through the discussion about the topic in the mail list > > >> >> including the existing FLIP-514 > > >> >> < > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-514%3A+Custom+Evaluator+plugin+for+Flink+Autoscaler > > >. > > >> >> Looks like the discussion is not finalized yet. > > >> >> To accelerate the process, we adopt and combine the > > >> >> existing opinions from the community and create a proposal in > > FLIP-543 > > >> >> < > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-543%3A+Support+Customized+Autoscale+Algorithm > > >. > > >> >> The basic idea > > >> >> is to make some core components of autoscaler pluggable, for > example, > > >> >> MetricsCollector, Metrics Evaluator, and ScalingRealizer, at the > same > > >> >> keep the core logic skeleton (which is already well justified in > > large > > >> >> amount of users) of autoscaler untouched. > > >> >> > > >> >> Looking forward to any feedback and opinions on FLIP-543. > > >> >> > > >> >> [1] > > >> >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-543%3A+Support+Customized+Autoscale+Algorithm > > >> >> [2] > > >> >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-514%3A+Custom+Evaluator+plugin+for+Flink+Autoscaler > > >> >> [3] Other related discussion thread > > >> >> > > >> >> https://lists.apache.org/thread/749l74z1h5jylkxrw3rtjmxcj2t9p7ws > > >> >> > > >> >> https://lists.apache.org/thread/mcd7jcn4kz6oqtyqq5hfycjf9mqh6c53 > > >> >> > > >> >> > > >> >> Best Regards > > >> >> Peter Huang > > >> >> > > >> > > > >