Re: Scaling Flink Jobs without Restarting Job

Talat Uyarer via dev Mon, 24 Jul 2023 09:30:33 -0700

Hi Chen,

Thanks for the reply. Yes we have a very low interval. But our jobs usually
have 10-20 independent Kafka topics. When we perform scaling we stop
reading from all. Do you think we can implement partial failover to
calculate Kafka Partition assignments? Or rather than sending an
interrupted signal. Is there any better way to stop Task after it finished
its work to prevent data duplication ?




On Sun, Jul 23, 2023 at 10:44 PM Chen Zhanghao <zhanghao.c...@outlook.com>
wrote:

> Hi Talat,
>
> In reactive mode, rescaling is performed by a whole-graph failover, which
> is already less costly compared to a full job restart where all containers
> need to be requested again. For simple stateless jobs, this usually won't
> take long (a few seconds), you can measure how long it takes for all tasks
> returning to RUNNING status during a rescaling. Fluctuating traffic might
> be caused by re-consuming some data when recovering from a previous
> checkpoint. In this case, reducing the checkpoint interval will help.
>
> As regards to partial failover for rescaling, it might be challenging.
> Rescaling stateless job will still involve redistribution of Kafka
> partitions (for Kafka sources for example) and requires some coordination
> works.
>
> Best,
> Zhanghao Chen
> ------------------------------
> *发件人:* Talat Uyarer via dev <dev@flink.apache.org>
> *发送时间:* 2023年7月23日 15:28
> *收件人:* dev <dev@flink.apache.org>
> *主题:* Scaling Flink Jobs without Restarting Job
>
> HI,
>
> We are using Flink with Adaptive Scheduler(Reactive Mode) on Kubernetes
> with Standalone deployment Application mode for our streaming
> infrastructure. Our autoscaler is scaling up or down our jobs. However,
> each scale action causes a job restart.
>
> Our customers complain about fluctuating traffic that we are sending. Is
> there any way to reschedule tasks and calculate graphs without restarting
> the whole job ? Or Reduce restart time ?
>
> Job is set max parallelism 2x of maxWorker and we use GCS for checkpointing
> storage. I know rescaling stateful jobs requires keygroups to be
> redistributed. But we have stateless jobs also Such as reading from Kafka
> and extracting data and writing a sink. If you can provide some entry
> points we can start implementation support for those jobs.
>
> Thanks
>

Re: Scaling Flink Jobs without Restarting Job

Reply via email to