Re: TM heartbeat timeout due to ResourceManager being busy

Xintong Song Sun, 11 Oct 2020 23:58:48 -0700

No worries :)


Thank you~

Xintong Song



On Mon, Oct 12, 2020 at 2:48 PM Paul Lam <paullin3...@gmail.com> wrote:

> Sorry for the misspelled name, Xintong
>
> Best,
> Paul Lam
>
> 2020年10月12日 14:46，Paul Lam <paullin3...@gmail.com> 写道：
>
> Hi Xingtong,
>
> Thanks a lot for the pointer!
>
> It’s good to see there would be a new IO executor to take care of the TM
> contexts. Looking forward to the 1.12 release!
>
> Best,
> Paul Lam
>
> 2020年10月12日 14:18，Xintong Song <tonysong...@gmail.com> 写道：
>
> Hi Paul,
>
> Thanks for reporting this.
>
> Indeed, Flink's RM currently performs several HDFS operations in the rpc
> main thread when preparing the TM context, which may block the main thread
> when HDFS is slow.
>
> Unfortunately, I don't see any out-of-box approach that fixes the problem
> at the moment, except for increasing the heartbeat timeout.
>
> As for the long run solution, I think there's an easier approach. We can
> move creating of the TM contexts away from the rpc main thread. Ideally, we
> should try to avoid performing any heavy operations which do not modify the
> RM's internal states in the rpc main thread. With FLINK-19241, this can be
> achieved easily by delegating the work to the io executor.
>
> Thank you~
> Xintong Song
>
>
>
> On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <paullin3...@gmail.com> wrote:
>
>> Hi,
>>
>> After FLINK-13184 is implemented (even with Flink 1.11), occasionally
>> there would still be jobs
>> with high parallelism getting TM-RM heartbeat timeouts when RM is busy
>> creating TM contexts
>> on cluster initialization and HDFS is slow at that moment.
>>
>> Apart from increasing the TM heartbeat timeout, is there any recommended
>>  out of the box
>> approach that can reduce the chance of getting the timeouts?
>>
>> In the long run, is it possible to limit the number of taskmanager
>> contexts that RM creates at
>> a time, so that the heartbeat triggers can chime in?
>>
>> Thanks!
>>
>> Best,
>> Paul Lam
>>
>
>
>

Re: TM heartbeat timeout due to ResourceManager being busy

Reply via email to