According to the UI it seems that " org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
" was the cause of a pipe restart. As to the TM it is an artifact of the new job allocation regime which will exhaust all slots on a TM rather then distributing them equitably. TMs selectively are under more stress then in a pure RR distribution I think. We may have to lower the slots on each TM to define a good upper bound. You are correct 50s is a a pretty generous value. On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <g...@data-artisans.com> wrote: > Hi, > > The first exception should be only logged on info level. It's expected to > see > this exception when a TaskManager unregisters from the ResourceManager. > > Heartbeats can be configured via heartbeat.interval and hearbeat.timeout > [1]. > The default timeout is 50s, which should be a generous value. It is > probably a > good idea to find out why the heartbeats cannot be answered by the TM. > > Best, > Gary > > [1] https://ci.apache.org/projects/flink/flink-docs- > release-1.5/ops/config.html#heartbeat-manager > > > On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi < > vishal.santo...@gmail.com> wrote: > >> 2 issues we are seeing on 1.5.1 on a streaming pipe line >> >> org.apache.flink.util.FlinkException: The assigned slot >> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed. >> >> >> and >> >> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id >> 208af709ef7be2d2dfc028ba3bbf4600 timed out. >> >> >> Not sure about the first but how do we increase the heartbeat interval of >> a TM >> >> Thanks much >> >> Vishal >> > >