Hi Talat,

In reactive mode, rescaling is performed by a whole-graph failover, which is 
already less costly compared to a full job restart where all containers need to 
be requested again. For simple stateless jobs, this usually won't take long (a 
few seconds), you can measure how long it takes for all tasks returning to 
RUNNING status during a rescaling. Fluctuating traffic might be caused by 
re-consuming some data when recovering from a previous checkpoint. In this 
case, reducing the checkpoint interval will help.

As regards to partial failover for rescaling, it might be challenging. 
Rescaling stateless job will still involve redistribution of Kafka partitions 
(for Kafka sources for example) and requires some coordination works.

Best,
Zhanghao Chen
________________________________
发件人: Talat Uyarer via dev <dev@flink.apache.org>
发送时间: 2023年7月23日 15:28
收件人: dev <dev@flink.apache.org>
主题: Scaling Flink Jobs without Restarting Job

HI,

We are using Flink with Adaptive Scheduler(Reactive Mode) on Kubernetes
with Standalone deployment Application mode for our streaming
infrastructure. Our autoscaler is scaling up or down our jobs. However,
each scale action causes a job restart.

Our customers complain about fluctuating traffic that we are sending. Is
there any way to reschedule tasks and calculate graphs without restarting
the whole job ? Or Reduce restart time ?

Job is set max parallelism 2x of maxWorker and we use GCS for checkpointing
storage. I know rescaling stateful jobs requires keygroups to be
redistributed. But we have stateless jobs also Such as reading from Kafka
and extracting data and writing a sink. If you can provide some entry
points we can start implementation support for those jobs.

Thanks

Reply via email to