Re: Slow restart from savepoint with large broadcast state when increasing parallelism

Ken Krugler Fri, 16 Dec 2022 07:51:39 -0800

Hi Jun,

Thanks for following up.


The state storage is internal at a client, and isn’t throttled.

Also restoring from the savepoint when we didn’t change the parallelism was 
fine.

I didn’t see any errors in the TM logs, but I didn’t carefully inspect them - 
I’ll do that when we give this another test.

Broadcast state is weird in that it’s duplicated, apparently avoid “hot spots” 
when restoring from state. So I’m wondering how Flink handles the case of 
restoring broadcast state when the parallelism increases.

Regards,

— Ken
 

> On Dec 15, 2022, at 4:33 PM, Jun Qin <qinjunje...@gmail.com> wrote:
> 
> Hi Ken,
> 
> Without knowning the details, the first thing I would suggest to check is 
> whether you have reached a threshold which is configured in your state 
> storage (e.g., s3) therefore your further download were throttled. Checking 
> your storage metrics or logs should help to confirm whether this is the case.
> 
> In addition, in those TMs where the restarting was slow, do you see anything 
> suspicious in the logs, e.g., reconnecting?
> 
> Thanks
> Jun
> 
> 
> 
> 
> 发自我的手机
> 
> 
> -------- 原始邮件 --------
> 发件人： Ken Krugler <kkrugler_li...@transpac.com>
> 日期： 2022年12月14日周三 19:32
> 收件人： User <user@flink.apache.org>
> 主 题： Slow restart from savepoint with large broadcast state when
> increasing parallelism
> Hi all,
> 
> I have a job with a large amount of broadcast state (62MB).
> 
> I took a savepoint when my workflow was running with parallelism 300.
> 
> I then restarted the workflow with parallelism 400.
> 
> The first 297 sub-tasks restored their broadcast state fairly quickly, but 
> after that it slowed to a crawl (maybe 2 sub-tasks finished per minute)
> 
> After 10 minutes we killed the job, so I don’t know if it would have 
> ultimately succeeded.
> 
> Is this expected? Seems like it could lead to a bad situation, where it would 
> take an hour to restart the workflow.
> 
> Thanks,
> 
> — Ken
> 
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com <http://www.scaleunlimited.com/>
> Custom big data solutions
> Flink, Pinot, Solr, Elasticsearch
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink, Pinot, Solr, Elasticsearch

Re: Slow restart from savepoint with large broadcast state when increasing parallelism

Reply via email to