[ https://issues.apache.org/jira/browse/FLINK-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285988#comment-14285988 ]
ASF GitHub Bot commented on FLINK-1352: --------------------------------------- Github user hsaputra commented on the pull request: https://github.com/apache/flink/pull/328#issuecomment-70888379 Hi @tillrohrmann, I don't think this PR change the retries strategy, does it? > Buggy registration from TaskManager to JobManager > ------------------------------------------------- > > Key: FLINK-1352 > URL: https://issues.apache.org/jira/browse/FLINK-1352 > Project: Flink > Issue Type: Bug > Components: JobManager, TaskManager > Affects Versions: 0.9 > Reporter: Stephan Ewen > Assignee: Till Rohrmann > Fix For: 0.9 > > > The JobManager's InstanceManager may refuse the registration attempt from a > TaskManager, because it has this taskmanager already connected, or,in the > future, because the TaskManager has been blacklisted as unreliable. > Unpon refused registration, the instance ID is null, to signal that refused > registration. TaskManager reacts incorrectly to such methods, assuming > successful registration > Possible solution: JobManager sends back a dedicated "RegistrationRefused" > message, if the instance manager returns null as the registration result. If > the TastManager receives that before being registered, it knows that the > registration response was lost (which should not happen on TCP and it would > indicate a corrupt connection) > Followup question: Does it make sense to have the TaskManager trying > indefinitely to connect to the JobManager. With increasing interval (from > seconds to minutes)? -- This message was sent by Atlassian JIRA (v6.3.4#6332)