Re: [DISCUSS] FLIP-271: Autoscaling

Gyula Fóra Sat, 05 Nov 2022 05:38:28 -0700

Hey!

Thanks for the input!


The algorithm does not really differentiate between scaling up or down as
it’s concerned about finding the right parallelism to match the target
processing rate with just enough spare capacity.

Let me try to address your specific points:

1. The backlog growth rate only matters for computing the target processing
rate for the sources. If the parallelism is high enough and there is no
back pressure it will be close to 0 so the target rate is the source read
rate. This is as intended. If we see that the sources are not busy and they
can read more than enough the algorithm would scale them down.

2. You are right , it’s dangerous to scale in too much, so we already
thought about limiting the scale down amount per scaling step/time window
to give more safety. But we can definitely think about different strategies
in the future!

The observation regarding max parallelism is very important and we should
always take that into consideration.

Cheers
Gyula

On Sat, 5 Nov 2022 at 11:46, Biao Geng <biaoge...@gmail.com> wrote:

> Hi Max,
>
> Thanks a lot for the FLIP. It is an extremely attractive feature!
>
> Just some follow up questions/thoughts after reading the FLIP:
> In the doc, the discussion of  the strategy of “scaling out” is thorough
> and convincing to me but it seems that “scaling down” is less discussed. I
> have 2 cents for this aspect:
>
>   1.  For source parallelisms, if the user configure a much larger value
> than normal, there should be very little pending records though it is
> possible to get optimized. But IIUC, in current algorithm, we will not take
> actions for this case as the backlog growth rate is almost zero. Is the
> understanding right?
>   2.  Compared with “scaling out”, “scaling in” is usually more dangerous
> as it is more likely to lead to negative influence to the downstream jobs.
> The min/max load bounds should be useful. I am wondering if it is possible
> to have different strategy for “scaling in” to make it more conservative.
> Or more eagerly, allow custom autoscaling strategy(e.g. time-based
> strategy).
>
> Another side thought is that to recover a job from checkpoint/savepoint,
> the new parallelism cannot be larger than max parallelism defined in the
> checkpoint(see this<
> https://github.com/apache/flink/blob/17a782c202c93343b8884cb52f4562f9c4ba593f/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L128>).
> Not sure if this limit should be mentioned in the FLIP.
>
> Again, thanks for the great work and looking forward to using flink k8s
> operator with it!
>
> Best,
> Biao Geng
>
> From: Maximilian Michels <m...@apache.org>
> Date: Saturday, November 5, 2022 at 2:37 AM
> To: dev <dev@flink.apache.org>
> Cc: Gyula Fóra <gyula.f...@gmail.com>, Thomas Weise <t...@apache.org>,
> Marton Balassi <mbala...@apache.org>, Őrhidi Mátyás <
> matyas.orh...@gmail.com>
> Subject: [DISCUSS] FLIP-271: Autoscaling
> Hi,
>
> I would like to kick off the discussion on implementing autoscaling for
> Flink as part of the Flink Kubernetes operator. I've outlined an approach
> here which I find promising:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
>
> I've been discussing this approach with some of the operator contributors:
> Gyula, Marton, Matyas, and Thomas (all in CC). We started prototyping an
> implementation based on the current FLIP design. If that goes well, we
> would like to contribute this to Flink based on the results of the
> discussion here.
>
> I'm curious to hear your thoughts.
>
> -Max
>

Re: [DISCUSS] FLIP-271: Autoscaling

Reply via email to