Thanks Gary.. What could be blocking the RPC threads? Slow checkpointing?
In production we're still using a self-built Flink package 1.5-SNAPSHOT, flink commit 8395508b0401353ed07375e22882e7581d46ac0e, and the jobs are stable. Now with 1.5.2 the same jobs are failing due to heartbeat timeouts every day. What changed between commit 8395508b0401353ed07375e22882e7581d46ac0e & release 1.5.2? Also, I just tried to run a slightly heavier job. It eventually had some heartbeat timeouts, and then this: 2018-08-15 01:49:58,156 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Kafka (topic1, topic2) -> Filter -> AppIdFilter([topic1, topic2]) -> XFilter -> EventMapFilter(AppFilters) (4/8) (da6e2ba425fb91316dd05e72e6518b24) switched from RUNNING to FAILED. org.apache.flink.util.FlinkException: The assigned slot container_1534167926397_0001_01_000002_1 was removed. After that the job tried to restart according to Flink restart strategy but that kept failing with this error: 2018-08-15 02:00:22,000 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job X (19bd504d2480ccb2b44d84fb1ef8af68) switched from state RUNNING to FAILING. org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 36, slots allocated: 12 This was repeated until all restart attempts had been used (we've set it to 50), and then the job finally failed. I would like to know also how to prevent Flink from going into such bad state. At least it should exit immediately instead of retrying in such a situation. And why was "slot container removed"? On Tue, Aug 14, 2018 at 11:24 PM Gary Yao <g...@data-artisans.com> wrote: > Hi Juho, > > It seems in your case the JobMaster did not receive a heartbeat from the > TaskManager in time [1]. Heartbeat requests and answers are sent over the > RPC > framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.) > are > dispatched by a single thread. Therefore, the reasons for heartbeats > timeouts > include: > > 1. The RPC threads of the TM or JM are blocked. In this case heartbeat > requests or answers cannot be dispatched. > 2. The scheduled task for sending the heartbeat requests [2] died. > 3. The network is flaky. > > If you are confident that the network is not the culprit, I would suggest > to > set the logging level to DEBUG, and look for periodic log messages (JM and > TM > logs) that are related to heartbeating. If the periodic log messages are > overdue, it is a hint that the main thread of the RPC endpoint is blocked > somewhere. > > Best, > Gary > > [1] > https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611 > [2] > https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64 > > On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <juho.au...@rovio.com> wrote: > >> I also have jobs failing on a daily basis with the error "Heartbeat of >> TaskManager with id <id> timed out". I'm using Flink 1.5.2. >> >> Could anyone suggest how to debug possible causes? >> >> I already set these in flink-conf.yaml, but I'm still getting failures: >> heartbeat.interval: 10000 >> heartbeat.timeout: 100000 >> >> Thanks. >> >> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> According to the UI it seems that " >>> >>> org.apache.flink.util.FlinkException: The assigned slot >>> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed. >>> >>> " was the cause of a pipe restart. >>> >>> As to the TM it is an artifact of the new job allocation regime which >>> will exhaust all slots on a TM rather then distributing them equitably. >>> TMs selectively are under more stress then in a pure RR distribution I >>> think. We may have to lower the slots on each TM to define a good upper >>> bound. You are correct 50s is a a pretty generous value. >>> >>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <g...@data-artisans.com> >>> wrote: >>> >>>> Hi, >>>> >>>> The first exception should be only logged on info level. It's expected >>>> to see >>>> this exception when a TaskManager unregisters from the ResourceManager. >>>> >>>> Heartbeats can be configured via heartbeat.interval and >>>> hearbeat.timeout [1]. >>>> The default timeout is 50s, which should be a generous value. It is >>>> probably a >>>> good idea to find out why the heartbeats cannot be answered by the TM. >>>> >>>> Best, >>>> Gary >>>> >>>> [1] >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager >>>> >>>> >>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line >>>>> >>>>> org.apache.flink.util.FlinkException: The assigned slot >>>>> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed. >>>>> >>>>> >>>>> and >>>>> >>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id >>>>> 208af709ef7be2d2dfc028ba3bbf4600 timed out. >>>>> >>>>> >>>>> Not sure about the first but how do we increase the heartbeat interval >>>>> of a TM >>>>> >>>>> Thanks much >>>>> >>>>> Vishal >>>>> >>>> >>>> >>> >> >