@Guyla, Thanks for the explanation and the follow up actions. That sounds good to me.
Thanks, JunRui Lee Yanfei Lei <fredia...@gmail.com> 于2022年11月7日周一 12:20写道: > Hi Max, > > Thanks for the proposal. This proposal makes Flink better adapted to > cloud-native applications! > > After reading the FLIP, I'm curious about some points: > > 1) It's said that "The first step is collecting metrics for all JobVertices > by combining metrics from all the runtime subtasks and computing the > *average*". When the load of the subtasks of an operator is not balanced, > do we need to trigger autoScaling? Has the median or some percentiles been > considered? > 2) IIUC, "FLIP-159: Reactive Mode" is somewhat similar to this proposal, > will we reuse some logic from Reactive Mode? > > Best, > Yanfei > > Gyula Fóra <gyula.f...@gmail.com> 于2022年11月7日周一 02:33写道: > > > Hi Dong! > > > > Let me try to answer the questions :) > > > > 1 : busyTimeMsPerSecond is not specific for CPU, it measures the time > spent > > in the main record processing loop for an operator if I > > understand correctly. This includes IO operations too. > > > > 2: We should add this to the FLIP I agree. It would be a Duration config > > with the expected catch up time after rescaling (let's say 5 minutes). It > > could be computed based on the current data rate and the calculated max > > processing rate after the rescale. > > > > 3: In the current proposal we don't have per operator configs. Target > > utilization would apply to all operators uniformly. > > > > 4: It should be configurable, yes. > > > > 5,6: The names haven't been finalized but I think these are minor > details. > > We could add concrete names to the FLIP :) > > > > Cheers, > > Gyula > > > > > > On Sun, Nov 6, 2022 at 5:19 PM Dong Lin <lindon...@gmail.com> wrote: > > > > > Hi Max, > > > > > > Thank you for the proposal. The proposal tackles a very important issue > > > for Flink users and the design looks promising overall! > > > > > > I have some questions to better understand the proposed public > interfaces > > > and the algorithm. > > > > > > 1) The proposal seems to assume that the operator's busyTimeMsPerSecond > > > could reach 1 sec. I believe this is mostly true for cpu-bound > operators. > > > Could you confirm that this can also be true for io-bound operators > such > > as > > > sinks? For example, suppose a Kafka Sink subtask has reached I/O > > bottleneck > > > when flushing data out to the Kafka clusters, will busyTimeMsPerSecond > > > reach 1 sec? > > > > > > 2) It is said that "users can configure a maximum time to fully process > > > the backlog". The configuration section does not seem to provide this > > > config. Could you specify this? And any chance this proposal can > provide > > > the formula for calculating the new processing rate? > > > > > > 3) How are users expected to specify the per-operator configs (e.g. > > target > > > utilization)? For example, should users specify it programmatically in > a > > > DataStream/Table/SQL API? > > > > > > 4) How often will the Flink Kubernetes operator query metrics from > > > JobManager? Is this configurable? > > > > > > 5) Could you specify the config name and default value for the proposed > > > configs? > > > > > > 6) Could you add the name/mbean/type for the proposed metrics? > > > > > > > > > Cheers, > > > Dong > > > > > > > > > > > >