Hi Folks,

Thanks for these suggestions. I think we aligned these two features are
common and should be implemented in upstream.
I try to summarize the AIs below. Please feel free to add more if I
miss anything.

1) Finish the planned work in FLIP-514 to support pluggable
MetricsEvaluator. Support the scheduled-scaling plugin as planned in
FLIP-514
2) Support the Predictive Autoscaling as a configurable feature on top of a
customized MetricsEvaluator in FLIP-543
3) Support the Data size aware autoscaling as configurable feature on top
of a customized MetricsEvaluator in FLIP-543

I will revise the FLIP-543 to talk about mainly focus on how Predictive
Autoscaling and  Data size aware autoscaling  could be implemented on top
of  pluggable MetricsEvaluator.

Best Regards
Peter Huang

On Thu, Aug 28, 2025 at 2:18 AM Rui Fan <1996fan...@gmail.com> wrote:

> Hi everyone,
>
> Thanks for the productive conversation on FLIP-543.
>
> I agree that we need more extensibility in the autoscaler. The predictive
> scaling
> use case is a perfect example of a powerful feature that would help many of
> us
> improve job availability by scaling before backlogs build up.
>
> To echo Gyula and Max's points, I also believe the best path forward is to
> build
> this capability as an extension to the existing framework, not as a
> replacement.
> This would offer a robust, community-driven solution for a common problem,
> which feels more sustainable than asking users to implement and maintain
> custom forks of the logic.
>
> Best,
> Rui
>
> On Thu, Aug 28, 2025 at 7:14 AM Pradeepta Choudhury
> <pchoudhur...@apple.com.invalid> wrote:
>
> > Hello Peter,
> >
> > To start with, great initiative! But I echo the same concern raised about
> > creating too many extension points can compromise the autoscaler
> > functionality.
> > When we proposed FLIP-514 [1] and a custom evaluator, the aim was
> twofold:
> > provide the required extension point and ship practical strategies as
> > pluggables. At the same time, we wanted to preserve flexibility for
> > advanced, highly specific scenarios—like predictive scaling—that differ
> by
> > ecosystem, platform, and company. The custom evaluator strikes that
> balance
> > was the thought process: it lets users adjust the evaluated
> > metrics—especially TARGET_DATA_RATE—that drive the scale-factor
> > calculation, enabling useful out-of-the-box behavior without constraining
> > bespoke implementations.
> > One of the desired outcomes we had set for FLIP-514 was to ship a
> > scheduled-scaling strategy as a pluggable, leveraging a baseline period
> and
> > explicit scheduled windows to drive planned capacity changes. I’ve been
> > away since last month due to personal commitments. I plan to resume after
> > first week of September and will complete the scheduled-scaling plugin to
> > wrap up the custom evaluator.
> > Having the ScalingRealizer pluggable (
> > https://github.com/apache/flink-kubernetes-operator/pull/1020/files),
> > definitely sounds helpful for certain scenarios.
> > But I totally agree with the general approach suggested by Gyula, about
> > solving specific issues independently in the "best possible way" and then
> > coming to a good solution regarding pluggability that could be foundation
> > for future use-cases.
> >
> >
> > Thanks and Regards
> > Pradeepta
> >
> >
> > > On 26 Aug 2025, at 6:05 PM, ctrlaltd...@icloud.com.invalid <
> > ctrlaltd...@icloud.com.INVALID> wrote:
> > >
> > > From the ScalingRealizer, I think having before/after  hooks for
> > `realizeParallelismOverrides` and `realizeConfigOverrides` would be good.
> > We can support these hooks from plugins, thoughts?
> > >
> > >
> > > Best,
> > > Diljeet(DJ) Singh
> > >
> > > On 2025/08/26 08:24:33 Maximilian Michels wrote:
> > >> Hi Peter,
> > >>
> > >> First of all, this is a great initiative. Flink Autoscaling definitely
> > >> needs more points of extension. We recently added support for hooking
> > >> into the metric evaluation (FLIP-514), but clearly that is just one
> > >> extension point.
> > >>
> > >> That said, I think we will need to revise the approach a bit. I'm not
> > >> sure, we should be replacing core components. As Gyula mentioned,
> > >> replacing those will easily break the entire autoscaler. Instead, we
> > >> should be adding extension points which allow for meaningful additions
> > >> without breaking the scaling logic. There is already the option to
> > >> replace the entire autoscaling module, if users really want to roll
> > >> out a completely custom version.
> > >>
> > >> What usually works best is to formulate the use case first, then
> > >> figure out what autoscaler customization would be necessary to
> > >> implement the use case.
> > >>
> > >> As for making the ScalingRealizer pluggable
> > >> (https://github.com/apache/flink-kubernetes-operator/pull/1020/files
> ),
> > >> I do think that makes sense for some scenarios.
> > >>
> > >> Cheers,
> > >> Max
> > >>
> > >> On Tue, Aug 26, 2025 at 8:59 AM Gyula Fóra <gy...@gmail.com> wrote:
> > >>>
> > >>> Hi Peter & Diljeet!
> > >>>
> > >>> My general feedback is that we should try to introduce extension
> > plugins instead of plugins that completely replace key parts of the
> > autoscaler code.
> > >>>
> > >>> Let me give you a concrete example through FLIP-514 and FLIP-543
> using
> > the MetricsEvaluator pluggability.
> > >>> The MetricsEvaluator in the autoscaler is responsible for
> > evaluating/deriving/calculating metrics from the collected metrics. It
> has
> > to calculate everything in a more or less specific way otherwise other
> > parts of the autoscaler that depend on these metrics may not work. It
> > doesn't seem very practical/resonable to completely reimplement this just
> > because someone wants to extend the logic, this is extremely error prone
> > and fragile especially if the autoscaler logic later evolves.
> > >>>
> > >>> FLIP-514 takes the approach to extend the metric evaluator with a new
> > method that allows users to at the end modify the evaluated metrics and
> > define custom ones. This is the right approach here as it makes a new
> > extension very simple to build and maintain without interfering with
> > existing logic.
> > >>>
> > >>> The approach in FLIP-543 and in Diljeet's example PR takes the
> > replacement approach to completely substitute the entire parts of the
> > implementation (the entire evaluator, scaling realizer etc). I think this
> > is not very good for either the community or the actual user. From a
> > community perspective it makes it harder to extend the logic with nice
> > small additions and from a user's perspective it is very error probe if
> the
> > operator autoscaler logic changes as it basically exposes a lot of
> internal
> > logic on a user interface.
> > >>>
> > >>> So at this point,  -1 for the approach in FLIP-543 from my side, but
> I
> > would love to hear the opinion of others as well.
> > >>>
> > >>> Cheers
> > >>> Gyula
> > >>>
> > >>> On Mon, Aug 25, 2025 at 11:44 PM Peter Huang <hu...@gmail.com>
> wrote:
> > >>>>
> > >>>> Hi Diljeet,
> > >>>>
> > >>>> Yes, I think we have similar requirements to make autoscaler even
> more
> > >>>> powerful to handle some customized requirements.
> > >>>> The quick PoC makes sense to me. Let's get some more feedback from
> the
> > >>>> community.
> > >>>>
> > >>>>
> > >>>>
> > >>>> Best Regards
> > >>>> Peter Huang
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Mon, Aug 25, 2025 at 2:37 PM Peter Huang <hu...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Just try to combine the discussion into one thread.
> > >>>>>
> > >>>>> @Diljeet Singh
> > >>>>> Posted a quick PoC for the proposal
> > >>>>> https://github.com/apache/flink-kubernetes-operator/pull/1020.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Mon, Aug 25, 2025 at 7:52 AM Peter Huang <hu...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Community,
> > >>>>>>
> > >>>>>> Our org has been heavily using the Flink autoscaling algorithm. It
> > >>>>>> greatly reduced our operation overhead and improved cost
> efficiency
> > >>>>>> as users always over provision resources when onboard. Recently,
> we
> > have
> > >>>>>> had some requirements to customize the auto scaling algorithm
> > >>>>>> for different scenarios, for example, during the holiday season
> > large but
> > >>>>>> predictable traffic spike, increase checkpoint interval together
> > with
> > >>>>>> scale up for streaming ingestion use cases.
> > >>>>>>
> > >>>>>> We search through the discussion about the topic in the mail list
> > >>>>>> including the existing FLIP-514
> > >>>>>> <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-514%3A+Custom+Evaluator+plugin+for+Flink+Autoscaler
> > >.
> > >>>>>> Looks like the discussion is not finalized yet.
> > >>>>>> To accelerate the process, we adopt and combine the
> > >>>>>> existing opinions from the community and create a proposal in
> > FLIP-543
> > >>>>>> <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-543%3A+Support+Customized+Autoscale+Algorithm
> > >.
> > >>>>>> The basic idea
> > >>>>>> is to make some core components of autoscaler pluggable, for
> > example,
> > >>>>>> MetricsCollector, Metrics Evaluator, and ScalingRealizer, at the
> > same
> > >>>>>> keep the core logic skeleton (which is already well justified in
> > large
> > >>>>>> amount of users) of autoscaler untouched.
> > >>>>>>
> > >>>>>> Looking forward to any feedback and opinions on FLIP-543.
> > >>>>>>
> > >>>>>> [1]
> > >>>>>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-543%3A+Support+Customized+Autoscale+Algorithm
> > >>>>>> [2]
> > >>>>>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-514%3A+Custom+Evaluator+plugin+for+Flink+Autoscaler
> > >>>>>> [3] Other related discussion thread
> > >>>>>>
> > >>>>>> https://lists.apache.org/thread/749l74z1h5jylkxrw3rtjmxcj2t9p7ws
> > >>>>>>
> > >>>>>> https://lists.apache.org/thread/mcd7jcn4kz6oqtyqq5hfycjf9mqh6c53
> > >>>>>>
> > >>>>>>
> > >>>>>> Best Regards
> > >>>>>> Peter Huang
> > >>>>>>
> > >>>>>
> >
> >
>

Reply via email to