Re: TM heartbeat timeout due to ResourceManager being busy

Paul Lam Sun, 11 Oct 2020 23:46:51 -0700

Hi Xingtong,

Thanks a lot for the pointer!


It’s good to see there would be a new IO executor to take care of the TM 
contexts. Looking forward to the 1.12 release!

Best,
Paul Lam

> 2020年10月12日 14:18，Xintong Song <tonysong...@gmail.com> 写道：
> 
> Hi Paul,
> 
> Thanks for reporting this.
> 
> Indeed, Flink's RM currently performs several HDFS operations in the rpc main 
> thread when preparing the TM context, which may block the main thread when 
> HDFS is slow.
> 
> Unfortunately, I don't see any out-of-box approach that fixes the problem at 
> the moment, except for increasing the heartbeat timeout.
> 
> As for the long run solution, I think there's an easier approach. We can move 
> creating of the TM contexts away from the rpc main thread. Ideally, we should 
> try to avoid performing any heavy operations which do not modify the RM's 
> internal states in the rpc main thread. With FLINK-19241, this can be 
> achieved easily by delegating the work to the io executor.
> 
> Thank you~
> Xintong Song
> 
> 
> On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <paullin3...@gmail.com 
> <mailto:paullin3...@gmail.com>> wrote:
> Hi,
> 
> After FLINK-13184 is implemented (even with Flink 1.11), occasionally there 
> would still be jobs 
> with high parallelism getting TM-RM heartbeat timeouts when RM is busy 
> creating TM contexts 
> on cluster initialization and HDFS is slow at that moment. 
> 
> Apart from increasing the TM heartbeat timeout, is there any recommended  out 
> of the box 
> approach that can reduce the chance of getting the timeouts? 
> 
> In the long run, is it possible to limit the number of taskmanager contexts 
> that RM creates at 
> a time, so that the heartbeat triggers can chime in? 
> 
> Thanks!
> 
> Best,
> Paul Lam

Re: TM heartbeat timeout due to ResourceManager being busy

Reply via email to