Hi Juho, It seems in your case the JobMaster did not receive a heartbeat from the TaskManager in time [1]. Heartbeat requests and answers are sent over the RPC framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.) are dispatched by a single thread. Therefore, the reasons for heartbeats timeouts include:
1. The RPC threads of the TM or JM are blocked. In this case heartbeat requests or answers cannot be dispatched. 2. The scheduled task for sending the heartbeat requests [2] died. 3. The network is flaky. If you are confident that the network is not the culprit, I would suggest to set the logging level to DEBUG, and look for periodic log messages (JM and TM logs) that are related to heartbeating. If the periodic log messages are overdue, it is a hint that the main thread of the RPC endpoint is blocked somewhere. Best, Gary [1] https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611 [2] https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64 On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <juho.au...@rovio.com> wrote: > I also have jobs failing on a daily basis with the error "Heartbeat of > TaskManager with id <id> timed out". I'm using Flink 1.5.2. > > Could anyone suggest how to debug possible causes? > > I already set these in flink-conf.yaml, but I'm still getting failures: > heartbeat.interval: 10000 > heartbeat.timeout: 100000 > > Thanks. > > On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <vishal.santo...@gmail.com> > wrote: > >> According to the UI it seems that " >> >> org.apache.flink.util.FlinkException: The assigned slot >> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed. >> >> " was the cause of a pipe restart. >> >> As to the TM it is an artifact of the new job allocation regime which >> will exhaust all slots on a TM rather then distributing them equitably. >> TMs selectively are under more stress then in a pure RR distribution I >> think. We may have to lower the slots on each TM to define a good upper >> bound. You are correct 50s is a a pretty generous value. >> >> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <g...@data-artisans.com> wrote: >> >>> Hi, >>> >>> The first exception should be only logged on info level. It's expected >>> to see >>> this exception when a TaskManager unregisters from the ResourceManager. >>> >>> Heartbeats can be configured via heartbeat.interval and hearbeat.timeout >>> [1]. >>> The default timeout is 50s, which should be a generous value. It is >>> probably a >>> good idea to find out why the heartbeats cannot be answered by the TM. >>> >>> Best, >>> Gary >>> >>> [1] https://ci.apache.org/projects/flink/flink-docs- >>> release-1.5/ops/config.html#heartbeat-manager >>> >>> >>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi < >>> vishal.santo...@gmail.com> wrote: >>> >>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line >>>> >>>> org.apache.flink.util.FlinkException: The assigned slot >>>> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed. >>>> >>>> >>>> and >>>> >>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id >>>> 208af709ef7be2d2dfc028ba3bbf4600 timed out. >>>> >>>> >>>> Not sure about the first but how do we increase the heartbeat interval >>>> of a TM >>>> >>>> Thanks much >>>> >>>> Vishal >>>> >>> >>> >> >