[ https://issues.apache.org/jira/browse/FLINK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379396#comment-15379396 ]
ASF GitHub Bot commented on FLINK-4152: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2257 [FLINK-4152] Allow re-registration of TMs at resource manager The `YarnFlinkResourceManager` does not allow the `JobManager` to re-register task managers have had been registered at the resource manager before. The consequence of the refusal is that the job manager rejects the registrations of these task managers. Such a scenario can happen if a `JobManager` loses leadership in an HA setting after it registered some TMs. The old behaviour was that the resource manager clears the list of registered workers and only accepts new registrations of task manager which have been started in a fresh container. However, in case that the previously registered TMs didn't die, they will try to reconnect to the new leader. The new leader will then ask the resource manager whether the TMs represent valid resources. Since the resource manager forgot about the already started containers, it rejects the TMs. This PR changes the behaviour of the resource manager such that it can no longer reject TMs. Instead of being asked it will simply be informed about the registered TMs by the JM. If the TM happens to be running in a container which was started by the RM, then it will monitor this container. In case that this container dies, the RM will notify the JM about the death of the TM. In that sense, the RM has no longer the authority to interfere with the JM-TM interactions and instead it is simply used as an additional monitoring service to detect dead TMs as a result of a failed container. Furthermore, the PR adds a de-duplication method to filter out concurrent registration runs on the task manager. Before, it happened that a `RefusedRegistration` triggers a new registration run without cancelling the old registration run. This could lead to a massive amount of registration messages if the TaskManager's registration was refused multiple times. The mechanism to de-duplicate `TriggerTaskManagerRegistration` works by assigning a registration run id which is changed whenever a new registration run is started. `TriggerTaskManagerRegistration` messages which have an outdated registration run id are then filtered out. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink FLINK-4152_YarnResourceManager Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2257.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2257 ---- commit 2a29fa4df98fd63a7f27e1607152cd6e54d25ad1 Author: Till Rohrmann <trohrm...@apache.org> Date: 2016-07-15T08:51:59Z Add YarnFlinkResourceManager test to reaccept task manager registrations from a re-elected job manager commit 6462d4750d79512fc93bbc60ca754c99142d1794 Author: Till Rohrmann <trohrm...@apache.org> Date: 2016-07-15T09:50:35Z Remove unnecessary sync logic between JobManager and ResourceManager commit 55ce0c01783d9c4927a9f4677309c805e2b624e7 Author: Till Rohrmann <trohrm...@apache.org> Date: 2016-07-15T10:12:12Z Avoid duplicate reigstration attempts in case of a refused registration commit aeb6ae5435dd88c640dc314b9ec31357815d080c Author: Till Rohrmann <trohrm...@apache.org> Date: 2016-07-15T13:04:54Z Add test case to check that not an excessive amount of RegisterTaskManager messages are sent ---- > TaskManager registration exponential backoff doesn't work > --------------------------------------------------------- > > Key: FLINK-4152 > URL: https://issues.apache.org/jira/browse/FLINK-4152 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination, TaskManager, YARN Client > Reporter: Robert Metzger > Assignee: Till Rohrmann > Attachments: logs.tgz > > > While testing Flink 1.1 I've found that the TaskManagers are logging many > messages when registering at the JobManager. > This is the log file: > https://gist.github.com/rmetzger/0cebe0419cdef4507b1e8a42e33ef294 > Its logging more than 3000 messages in less than a minute. I don't think that > this is the expected behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)