Liu created FLINK-11215:
---------------------------
Summary: TaskExecutor RegistrationTimeoutException within the
specified maximum registration duration 300000ms
Key: FLINK-11215
URL: https://issues.apache.org/jira/browse/FLINK-11215
Project: Flink
Issue Type: Bug
Reporter: Liu
Attachments: image-2018-12-25-14-50-35-348.png
Sometimes, job will fail after 5 minutes because register fail at resource
manager.
!https://wiki.corp.kuaishou.com/download/attachments/113313620/image2018-12-14_20-29-41.png?version=1&modificationDate=1544790582000&api=v2!
But it register successful 5 minutes ago in fact (Tag ljg is added by me for
test).
!image-2018-12-25-14-50-35-348.png!
This problem appears for that the function startRegistrationTimeout in
TaskExecutor is executed in multiple places.
In the function start, it will be executed by
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener()) in
async way. Also, it will be executed in the end of the start function. The
order of these two executions is not guaranteed but they will change the same
variable currentRegistrationTimeoutId. If the async way is fast enough to
execute startRegistrationTimeout() first. It will fail 5 minutes later for the
startRegistrationTimeout's execution in the end of the start function.
The solution is to put the function startRegistrationTimeout in the start of
the start function. After doing this, the problem never appears again.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)