回复: [Question] How to scale application based on 'reactive' mode

Chen Zhanghao Mon, 04 Sep 2023 23:59:38 -0700

Hi Dennis,

1. In Flink 1.18 + non-reactive mode, autoscaler adjusts the job's
parallelism and the job will request for extra TMs if the current ones cannot
satisfy its need and redundant TMs will be released automatically later for
being idle. In other words, parallelism changes cause TM number change.
2. The core metrics used is busy time (the amount of time spent on task
processing per 1 second = 1 s - backpressured time - idle time), it is
considered to be superior as it counts I/O cost etc into account as well. Also,
the metrics is on a per-task granularity and allows us to identify bottleneck
tasks.
3. Autoscaler feature currently only works for K8s opeartor + native K8s
mode.

Best,
Zhanghao Chen
________________________________
发件人: Dennis Jung <inylov...@gmail.com>
发送时间: 2023年9月2日 12:58
收件人: Gyula Fóra <gyula.f...@gmail.com>
抄送: user@flink.apache.org <user@flink.apache.org>
主题: Re: [Question] How to scale application based on 'reactive' mode

Hello,
Thanks for your notice.

1. In "Flink 1.18 + non-reactive", is parallelism being changed by the number
of TM?
2. In the
document(https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.6/docs/custom-resource/autoscaler/),
it said "we are not using any container memory / CPU utilization metrics
directly here". Which metrics are these using internally?
3. I'm using standalone
k8s(https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/)
for deployment. Is autoscaler features only available by using the "flink k8s
operator"(sorry I don't understand this clearly yet...)?

Regards

2023년 9월 1일 (금) 오후 10:20, Gyula Fóra
<gyula.f...@gmail.com<mailto:gyula.f...@gmail.com>>님이 작성:
Pretty much, except that with Flink 1.18 autoscaler can scale the job in place
without restarting the JM (even without reactive mode )

So actually best option is autoscaler with Flink 1.18 native mode (no reactive)

Gyula

On Fri, 1 Sep 2023 at 13:54, Dennis Jung
<inylov...@gmail.com<mailto:inylov...@gmail.com>> wrote:
Thanks for feedback.
Could you check whether I understand correctly?

Only using 'reactive' mode:
By manually adding TaskManager(TM) (such as using './bin/taskmanager.sh
start'), parallelism will be increased. For example, when job parallelism is 1
and TM is 1, and if adding 1 new TM, JobManager will be restarted and
parallelism will be 2.
But the number of TM is not being controlled automatically.

Autoscaler + non-reactive:
It can flexibilly control the number of TM by several metrics(CPU usage,
throughput, ...), and JobManager will be restarted when scaling. But job
parallelism is the same after the number of TM has been changed.

Autoscaler + 'reactive' mode:
It can control numbers of TM by metric, and increase/decrease job parallelism
by changing TM.

Regards,
Jung

2023년 9월 1일 (금) 오후 8:16, Gyula Fóra
<gyula.f...@gmail.com<mailto:gyula.f...@gmail.com>>님이 작성:
I would look at reactive scaling as a way to increase / decrease parallelism.

It’s not a way to automatically decide when to actually do it as you need to
create new TMs .

The autoscaler could use reactive mode to change the parallelism but you need
the autoscaler itself to decide when new resources should be added

On Fri, 1 Sep 2023 at 13:09, Dennis Jung
<inylov...@gmail.com<mailto:inylov...@gmail.com>> wrote:
For now, the thing I've found about 'reactive' mode is that it automatically
adjusts 'job parallelism' when TaskManager is increased/decreased.

https://www.slideshare.net/FlinkForward/autoscaling-flink-with-reactive-mode

Is there some other feature that only 'reactive' mode offers for scaling?

Thanks.
Regards.

2023년 9월 1일 (금) 오후 4:56, Dennis Jung
<inylov...@gmail.com<mailto:inylov...@gmail.com>>님이 작성:
Hello,
Thank you for your response. I have few more questions in following:
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/elastic_scaling/

Reactive Mode configures a job so that it always uses all resources available
in the cluster. Adding a TaskManager will scale up your job, removing resources
will scale it down. Flink will manage the parallelism of the job, always
setting it to the highest possible values.
=> Does this mean when I add/remove TaskManager in 'non-reactive' mode,
resource(CPU/Memory/Etc.) of the cluster is not being changed?

Reactive Mode restarts a job on a rescaling event, restoring it from the latest
completed checkpoint. This means that there is no overhead of creating a
savepoint (which is needed for manually rescaling a job). Also, the amount of
data that is reprocessed after rescaling depends on the checkpointing interval,
and the restore time depends on the state size.
=> As I know 'rescaling' also works in non-reactive mode, with restoring
checkpoint. What is the difference of using 'reactive' here?

The Reactive Mode allows Flink users to implement a powerful autoscaling
mechanism, by having an external service monitor certain metrics, such as
consumer lag, aggregate CPU utilization, throughput or latency. As soon as
these metrics are above or below a certain threshold, additional TaskManagers
can be added or removed from the Flink cluster.
=> Why is this only possible in 'reactive' mode? Seems this is more related to
'autoscaler'. Are there some specific features/API which can control
TaskManager/Parallelism only in 'reactive' mode?

Thank you.

2023년 9월 1일 (금) 오후 3:30, Gyula Fóra
<gyula.f...@gmail.com<mailto:gyula.f...@gmail.com>>님이 작성:
The reactive mode reacts to available resources. The autoscaler reacts to
changing load and processing capacity and adjusts resources.

Completely different concepts and applicability.
Most people want the autoscaler , but this is a recent feature and is specific
to the k8s operator at the moment.

Gyula

On Fri, 1 Sep 2023 at 04:50, Dennis Jung
<inylov...@gmail.com<mailto:inylov...@gmail.com>> wrote:
Hello,
Thanks for your notice.

Than what is the purpose of using 'reactive', if this doesn't do anything
itself?
What is the difference if I use auto-scaler without 'reactive' mode?

Regards,
Jung

2023년 8월 18일 (금) 오후 7:51, Gyula Fóra
<gyula.f...@gmail.com<mailto:gyula.f...@gmail.com>>님이 작성:
Hi!

I think what you need is probably not the reactive mode but a proper
autoscaler. The reactive mode as you say doesn't do anything in itself, you
need to build a lot of logic around it.

Check this instead:
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/

The Kubernetes Operator has a built in autoscaler that can scale jobs based on
kafka data rate / processing throughput. It also doesn't rely on the reactive
mode.

Cheers,
Gyula

On Fri, Aug 18, 2023 at 12:43 PM Dennis Jung
<inylov...@gmail.com<mailto:inylov...@gmail.com>> wrote:
Hello,
Sorry for frequent questions. This is a question about 'reactive' mode.

1. As far as I understand, though I've setup `scheduler-mode: reactive`, it
will not change parallelism automatically by itself, by CPU usage or Kafka
consumer rate. It needs additional resource monitor features (such as
Horizontal Pod Autoscaler, or else). Is this correct?
2. Is it possible to create a custom resource monitor provider application? For
example, if I want to increase/decrease parallelism by Kafka consumer rate, do
I need to send specific API from outside, to order rescaling?
3. If 2 is correct, what is the difference when using 'reactive' mode? Because
as far as I think, calling a specific API will rescale either using 'reactive'
mode or not...(or is the API just working based on this mode)?

Thanks.

Regards

回复: [Question] How to scale application based on 'reactive' mode

Reply via email to