Re: [DISCUSS] FLIP-271: Autoscaling

JunRui Lee Mon, 07 Nov 2022 01:25:43 -0800

@Guyla,

Thanks for the explanation and the follow up actions. That sounds good to
me.


Thanks,
JunRui Lee

Yanfei Lei <[email protected]> 于2022年11月7日周一 12:20写道：

> Hi Max,
>
> Thanks for the proposal. This proposal makes Flink better adapted to
> cloud-native applications!
>
> After reading the FLIP, I'm curious about some points:
>
> 1) It's said that "The first step is collecting metrics for all JobVertices
> by combining metrics from all the runtime subtasks and computing the
> *average*". When the load of the subtasks of an operator is not balanced,
> do we need to trigger autoScaling? Has the median or some percentiles been
> considered?
> 2) IIUC, "FLIP-159: Reactive Mode" is somewhat similar to this proposal,
> will we reuse some logic from Reactive Mode?
>
> Best,
> Yanfei
>
> Gyula Fóra <[email protected]> 于2022年11月7日周一 02:33写道：
>
> > Hi Dong!
> >
> > Let me try to answer the questions :)
> >
> > 1 : busyTimeMsPerSecond is not specific for CPU, it measures the time
> spent
> > in the main record processing loop for an operator if I
> > understand correctly. This includes IO operations too.
> >
> > 2: We should add this to the FLIP I agree. It would be a Duration config
> > with the expected catch up time after rescaling (let's say 5 minutes). It
> > could be computed based on the current data rate and the calculated max
> > processing rate after the rescale.
> >
> > 3: In the current proposal we don't have per operator configs. Target
> > utilization would apply to all operators uniformly.
> >
> > 4: It should be configurable, yes.
> >
> > 5,6: The names haven't been finalized but I think these are minor
> details.
> > We could add concrete names to the FLIP :)
> >
> > Cheers,
> > Gyula
> >
> >
> > On Sun, Nov 6, 2022 at 5:19 PM Dong Lin <[email protected]> wrote:
> >
> > > Hi Max,
> > >
> > > Thank you for the proposal. The proposal tackles a very important issue
> > > for Flink users and the design looks promising overall!
> > >
> > > I have some questions to better understand the proposed public
> interfaces
> > > and the algorithm.
> > >
> > > 1) The proposal seems to assume that the operator's busyTimeMsPerSecond
> > > could reach 1 sec. I believe this is mostly true for cpu-bound
> operators.
> > > Could you confirm that this can also be true for io-bound operators
> such
> > as
> > > sinks? For example, suppose a Kafka Sink subtask has reached I/O
> > bottleneck
> > > when flushing data out to the Kafka clusters, will busyTimeMsPerSecond
> > > reach 1 sec?
> > >
> > > 2) It is said that "users can configure a maximum time to fully process
> > > the backlog". The configuration section does not seem to provide this
> > > config. Could you specify this? And any chance this proposal can
> provide
> > > the formula for calculating the new processing rate?
> > >
> > > 3) How are users expected to specify the per-operator configs (e.g.
> > target
> > > utilization)? For example, should users specify it programmatically in
> a
> > > DataStream/Table/SQL API?
> > >
> > > 4) How often will the Flink Kubernetes operator query metrics from
> > > JobManager? Is this configurable?
> > >
> > > 5) Could you specify the config name and default value for the proposed
> > > configs?
> > >
> > > 6) Could you add the name/mbean/type for the proposed metrics?
> > >
> > >
> > > Cheers,
> > > Dong
> > >
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-271: Autoscaling

Reply via email to