According to the UI it seems that "

org.apache.flink.util.FlinkException: The assigned slot
208af709ef7be2d2dfc028ba3bbf4600_10 was removed.

" was the cause of a pipe restart.

As to the TM it is an artifact of the new job allocation regime which will
exhaust all slots on a TM rather then distributing them equitably.  TMs
selectively are under more stress then in a pure RR distribution I think.
We may have to lower the slots on each TM to define a good upper bound. You
are correct 50s is a a pretty generous value.

On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <g...@data-artisans.com> wrote:

> Hi,
>
> The first exception should be only logged on info level. It's expected to
> see
> this exception when a TaskManager unregisters from the ResourceManager.
>
> Heartbeats can be configured via heartbeat.interval and hearbeat.timeout
> [1].
> The default timeout is 50s, which should be a generous value. It is
> probably a
> good idea to find out why the heartbeats cannot be answered by the TM.
>
> Best,
> Gary
>
> [1] https://ci.apache.org/projects/flink/flink-docs-
> release-1.5/ops/config.html#heartbeat-manager
>
>
> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
> vishal.santo...@gmail.com> wrote:
>
>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>
>> org.apache.flink.util.FlinkException: The assigned slot 
>> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>
>>
>> and
>>
>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 
>> 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>
>>
>> Not sure about the first but how do we increase the heartbeat interval of
>> a TM
>>
>> Thanks much
>>
>> Vishal
>>
>
>

Reply via email to