Re: [DISCUSS] FLIP-271: Autoscaling

Yanfei Lei Sun, 06 Nov 2022 20:20:38 -0800

Hi Max,

Thanks for the proposal. This proposal makes Flink better adapted to
cloud-native applications!


After reading the FLIP, I'm curious about some points:

1) It's said that "The first step is collecting metrics for all JobVertices
by combining metrics from all the runtime subtasks and computing the
*average*". When the load of the subtasks of an operator is not balanced,
do we need to trigger autoScaling? Has the median or some percentiles been
considered?
2) IIUC, "FLIP-159: Reactive Mode" is somewhat similar to this proposal,
will we reuse some logic from Reactive Mode?

Best,
Yanfei

Gyula Fóra <[email protected]> 于2022年11月7日周一 02:33写道：

> Hi Dong!
>
> Let me try to answer the questions :)
>
> 1 : busyTimeMsPerSecond is not specific for CPU, it measures the time spent
> in the main record processing loop for an operator if I
> understand correctly. This includes IO operations too.
>
> 2: We should add this to the FLIP I agree. It would be a Duration config
> with the expected catch up time after rescaling (let's say 5 minutes). It
> could be computed based on the current data rate and the calculated max
> processing rate after the rescale.
>
> 3: In the current proposal we don't have per operator configs. Target
> utilization would apply to all operators uniformly.
>
> 4: It should be configurable, yes.
>
> 5,6: The names haven't been finalized but I think these are minor details.
> We could add concrete names to the FLIP :)
>
> Cheers,
> Gyula
>
>
> On Sun, Nov 6, 2022 at 5:19 PM Dong Lin <[email protected]> wrote:
>
> > Hi Max,
> >
> > Thank you for the proposal. The proposal tackles a very important issue
> > for Flink users and the design looks promising overall!
> >
> > I have some questions to better understand the proposed public interfaces
> > and the algorithm.
> >
> > 1) The proposal seems to assume that the operator's busyTimeMsPerSecond
> > could reach 1 sec. I believe this is mostly true for cpu-bound operators.
> > Could you confirm that this can also be true for io-bound operators such
> as
> > sinks? For example, suppose a Kafka Sink subtask has reached I/O
> bottleneck
> > when flushing data out to the Kafka clusters, will busyTimeMsPerSecond
> > reach 1 sec?
> >
> > 2) It is said that "users can configure a maximum time to fully process
> > the backlog". The configuration section does not seem to provide this
> > config. Could you specify this? And any chance this proposal can provide
> > the formula for calculating the new processing rate?
> >
> > 3) How are users expected to specify the per-operator configs (e.g.
> target
> > utilization)? For example, should users specify it programmatically in a
> > DataStream/Table/SQL API?
> >
> > 4) How often will the Flink Kubernetes operator query metrics from
> > JobManager? Is this configurable?
> >
> > 5) Could you specify the config name and default value for the proposed
> > configs?
> >
> > 6) Could you add the name/mbean/type for the proposed metrics?
> >
> >
> > Cheers,
> > Dong
> >
> >
> >
>

Re: [DISCUSS] FLIP-271: Autoscaling

Reply via email to