回复：Slow restart from savepoint with large broadcast state when increasing parallelism

Jun Qin Thu, 15 Dec 2022 13:33:43 -0800

Hi Ken,

Without knowning the details, the first thing I would suggest to check is whether you have reached a threshold which is configured in your state storage (e.g., s3) therefore your further download were throttled. Checking your storage metrics or logs should help to confirm whether this is the case.

In addition, in those TMs where the restarting was slow, do you see anything suspicious in the logs, e.g., reconnecting?

Thanks

Jun

发自我的手机

-------- 原始邮件 --------
发件人： Ken Krugler <kkrugler_li...@transpac.com>
日期： 2022年12月14日周三 19:32
收件人： User <user@flink.apache.org>
主题： Slow restart from savepoint with large broadcast state when
increasing parallelism

Hi all,

I have a job with a large amount of broadcast state (62MB).

I took a savepoint when my workflow was running with parallelism 300.

I then restarted the workflow with parallelism 400.

The first 297 sub-tasks restored their broadcast state fairly quickly, but after that it slowed to a crawl (maybe 2 sub-tasks finished per minute)

After 10 minutes we killed the job, so I don’t know if it would have ultimately succeeded.

Is this expected? Seems like it could lead to a bad situation, where it would take an hour to restart the workflow.

Thanks,

— Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink, Pinot, Solr, Elasticsearch

回复：Slow restart from savepoint with large broadcast state when increasing parallelism

Reply via email to