Hi Talat, In reactive mode, rescaling is performed by a whole-graph failover, which is already less costly compared to a full job restart where all containers need to be requested again. For simple stateless jobs, this usually won't take long (a few seconds), you can measure how long it takes for all tasks returning to RUNNING status during a rescaling. Fluctuating traffic might be caused by re-consuming some data when recovering from a previous checkpoint. In this case, reducing the checkpoint interval will help.
As regards to partial failover for rescaling, it might be challenging. Rescaling stateless job will still involve redistribution of Kafka partitions (for Kafka sources for example) and requires some coordination works. Best, Zhanghao Chen ________________________________ 发件人: Talat Uyarer via dev <dev@flink.apache.org> 发送时间: 2023年7月23日 15:28 收件人: dev <dev@flink.apache.org> 主题: Scaling Flink Jobs without Restarting Job HI, We are using Flink with Adaptive Scheduler(Reactive Mode) on Kubernetes with Standalone deployment Application mode for our streaming infrastructure. Our autoscaler is scaling up or down our jobs. However, each scale action causes a job restart. Our customers complain about fluctuating traffic that we are sending. Is there any way to reschedule tasks and calculate graphs without restarting the whole job ? Or Reduce restart time ? Job is set max parallelism 2x of maxWorker and we use GCS for checkpointing storage. I know rescaling stateful jobs requires keygroups to be redistributed. But we have stateless jobs also Such as reading from Kafka and extracting data and writing a sink. If you can provide some entry points we can start implementation support for those jobs. Thanks