Hi Xingtong, Thanks a lot for the pointer!
It’s good to see there would be a new IO executor to take care of the TM contexts. Looking forward to the 1.12 release! Best, Paul Lam > 2020年10月12日 14:18,Xintong Song <tonysong...@gmail.com> 写道: > > Hi Paul, > > Thanks for reporting this. > > Indeed, Flink's RM currently performs several HDFS operations in the rpc main > thread when preparing the TM context, which may block the main thread when > HDFS is slow. > > Unfortunately, I don't see any out-of-box approach that fixes the problem at > the moment, except for increasing the heartbeat timeout. > > As for the long run solution, I think there's an easier approach. We can move > creating of the TM contexts away from the rpc main thread. Ideally, we should > try to avoid performing any heavy operations which do not modify the RM's > internal states in the rpc main thread. With FLINK-19241, this can be > achieved easily by delegating the work to the io executor. > > Thank you~ > Xintong Song > > > On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <paullin3...@gmail.com > <mailto:paullin3...@gmail.com>> wrote: > Hi, > > After FLINK-13184 is implemented (even with Flink 1.11), occasionally there > would still be jobs > with high parallelism getting TM-RM heartbeat timeouts when RM is busy > creating TM contexts > on cluster initialization and HDFS is slow at that moment. > > Apart from increasing the TM heartbeat timeout, is there any recommended out > of the box > approach that can reduce the chance of getting the timeouts? > > In the long run, is it possible to limit the number of taskmanager contexts > that RM creates at > a time, so that the heartbeat triggers can chime in? > > Thanks! > > Best, > Paul Lam