Hi Max, Thanks for the proposal. This proposal makes Flink better adapted to cloud-native applications!
After reading the FLIP, I'm curious about some points: 1) It's said that "The first step is collecting metrics for all JobVertices by combining metrics from all the runtime subtasks and computing the *average*". When the load of the subtasks of an operator is not balanced, do we need to trigger autoScaling? Has the median or some percentiles been considered? 2) IIUC, "FLIP-159: Reactive Mode" is somewhat similar to this proposal, will we reuse some logic from Reactive Mode? Best, Yanfei Gyula Fóra <gyula.f...@gmail.com> 于2022年11月7日周一 02:33写道: > Hi Dong! > > Let me try to answer the questions :) > > 1 : busyTimeMsPerSecond is not specific for CPU, it measures the time spent > in the main record processing loop for an operator if I > understand correctly. This includes IO operations too. > > 2: We should add this to the FLIP I agree. It would be a Duration config > with the expected catch up time after rescaling (let's say 5 minutes). It > could be computed based on the current data rate and the calculated max > processing rate after the rescale. > > 3: In the current proposal we don't have per operator configs. Target > utilization would apply to all operators uniformly. > > 4: It should be configurable, yes. > > 5,6: The names haven't been finalized but I think these are minor details. > We could add concrete names to the FLIP :) > > Cheers, > Gyula > > > On Sun, Nov 6, 2022 at 5:19 PM Dong Lin <lindon...@gmail.com> wrote: > > > Hi Max, > > > > Thank you for the proposal. The proposal tackles a very important issue > > for Flink users and the design looks promising overall! > > > > I have some questions to better understand the proposed public interfaces > > and the algorithm. > > > > 1) The proposal seems to assume that the operator's busyTimeMsPerSecond > > could reach 1 sec. I believe this is mostly true for cpu-bound operators. > > Could you confirm that this can also be true for io-bound operators such > as > > sinks? For example, suppose a Kafka Sink subtask has reached I/O > bottleneck > > when flushing data out to the Kafka clusters, will busyTimeMsPerSecond > > reach 1 sec? > > > > 2) It is said that "users can configure a maximum time to fully process > > the backlog". The configuration section does not seem to provide this > > config. Could you specify this? And any chance this proposal can provide > > the formula for calculating the new processing rate? > > > > 3) How are users expected to specify the per-operator configs (e.g. > target > > utilization)? For example, should users specify it programmatically in a > > DataStream/Table/SQL API? > > > > 4) How often will the Flink Kubernetes operator query metrics from > > JobManager? Is this configurable? > > > > 5) Could you specify the config name and default value for the proposed > > configs? > > > > 6) Could you add the name/mbean/type for the proposed metrics? > > > > > > Cheers, > > Dong > > > > > > >