Thanks Gary..

What could be blocking the RPC threads? Slow checkpointing?

In production we're still using a self-built Flink package 1.5-SNAPSHOT,
flink commit 8395508b0401353ed07375e22882e7581d46ac0e, and the jobs are
stable.

Now with 1.5.2 the same jobs are failing due to heartbeat timeouts every
day. What changed between commit 8395508b0401353ed07375e22882e7581d46ac0e &
release 1.5.2?

Also, I just tried to run a slightly heavier job. It eventually had some
heartbeat timeouts, and then this:

2018-08-15 01:49:58,156 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
Kafka (topic1, topic2) -> Filter -> AppIdFilter([topic1, topic2]) ->
XFilter -> EventMapFilter(AppFilters) (4/8)
(da6e2ba425fb91316dd05e72e6518b24) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot
container_1534167926397_0001_01_000002_1 was removed.

After that the job tried to restart according to Flink restart strategy but
that kept failing with this error:

2018-08-15 02:00:22,000 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job X
(19bd504d2480ccb2b44d84fb1ef8af68) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms. Slots
required: 36, slots allocated: 12

This was repeated until all restart attempts had been used (we've set it to
50), and then the job finally failed.

I would like to know also how to prevent Flink from going into such bad
state. At least it should exit immediately instead of retrying in such a
situation. And why was "slot container removed"?

On Tue, Aug 14, 2018 at 11:24 PM Gary Yao <g...@data-artisans.com> wrote:

> Hi Juho,
>
> It seems in your case the JobMaster did not receive a heartbeat from the
> TaskManager in time [1]. Heartbeat requests and answers are sent over the
> RPC
> framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.)
> are
> dispatched by a single thread. Therefore, the reasons for heartbeats
> timeouts
> include:
>
>     1. The RPC threads of the TM or JM are blocked. In this case heartbeat
> requests or answers cannot be dispatched.
>     2. The scheduled task for sending the heartbeat requests [2] died.
>     3. The network is flaky.
>
> If you are confident that the network is not the culprit, I would suggest
> to
> set the logging level to DEBUG, and look for periodic log messages (JM and
> TM
> logs) that are related to heartbeating. If the periodic log messages are
> overdue, it is a hint that the main thread of the RPC endpoint is blocked
> somewhere.
>
> Best,
> Gary
>
> [1]
> https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611
> [2]
> https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64
>
> On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <juho.au...@rovio.com> wrote:
>
>> I also have jobs failing on a daily basis with the error "Heartbeat of
>> TaskManager with id <id> timed out". I'm using Flink 1.5.2.
>>
>> Could anyone suggest how to debug possible causes?
>>
>> I already set these in flink-conf.yaml, but I'm still getting failures:
>> heartbeat.interval: 10000
>> heartbeat.timeout: 100000
>>
>> Thanks.
>>
>> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> According to the UI it seems that "
>>>
>>> org.apache.flink.util.FlinkException: The assigned slot 
>>> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>
>>> " was the cause of a pipe restart.
>>>
>>> As to the TM it is an artifact of the new job allocation regime which
>>> will exhaust all slots on a TM rather then distributing them equitably.
>>> TMs selectively are under more stress then in a pure RR distribution I
>>> think. We may have to lower the slots on each TM to define a good upper
>>> bound. You are correct 50s is a a pretty generous value.
>>>
>>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <g...@data-artisans.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> The first exception should be only logged on info level. It's expected
>>>> to see
>>>> this exception when a TaskManager unregisters from the ResourceManager.
>>>>
>>>> Heartbeats can be configured via heartbeat.interval and
>>>> hearbeat.timeout [1].
>>>> The default timeout is 50s, which should be a generous value. It is
>>>> probably a
>>>> good idea to find out why the heartbeats cannot be answered by the TM.
>>>>
>>>> Best,
>>>> Gary
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager
>>>>
>>>>
>>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>>>> vishal.santo...@gmail.com> wrote:
>>>>
>>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>>>
>>>>> org.apache.flink.util.FlinkException: The assigned slot 
>>>>> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>>
>>>>>
>>>>> and
>>>>>
>>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 
>>>>> 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>>>
>>>>>
>>>>> Not sure about the first but how do we increase the heartbeat interval
>>>>> of a TM
>>>>>
>>>>> Thanks much
>>>>>
>>>>> Vishal
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to