Hi Ken,

Without knowning the details, the first thing I would suggest to check is whether you have reached a threshold which is configured in your state storage (e.g., s3) therefore your further download were throttled. Checking your storage metrics or logs should help to confirm whether this is the case.

In addition, in those TMs where the restarting was slow, do you see anything suspicious in the logs, e.g., reconnecting?

Thanks
Jun




发自我的手机


-------- 原始邮件 --------
发件人: Ken Krugler <kkrugler_li...@transpac.com>
日期: 2022年12月14日周三 19:32
收件人: User <user@flink.apache.org>
主 题: Slow restart from savepoint with large broadcast state when
increasing parallelism
Hi all,

I have a job with a large amount of broadcast state (62MB).

I took a savepoint when my workflow was running with parallelism 300.

I then restarted the workflow with parallelism 400.

The first 297 sub-tasks restored their broadcast state fairly quickly, but after that it slowed to a crawl (maybe 2 sub-tasks finished per minute)

After 10 minutes we killed the job, so I don’t know if it would have ultimately succeeded.

Is this expected? Seems like it could lead to a bad situation, where it would take an hour to restart the workflow.

Thanks,

— Ken

--------------------------
Ken Krugler
Custom big data solutions
Flink, Pinot, Solr, Elasticsearch

Reply via email to