[ https://issues.apache.org/jira/browse/FLINK-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-10255: ----------------------------------- Labels: pull-request-available (was: ) > Standby Dispatcher locks submitted JobGraphs > -------------------------------------------- > > Key: FLINK-10255 > URL: https://issues.apache.org/jira/browse/FLINK-10255 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.5.3, 1.6.0, 1.7.0 > Reporter: Till Rohrmann > Assignee: Till Rohrmann > Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.1, 1.7.0, 1.5.4 > > > Currently, standby {{Dispatchers}} lock submitted {{JobGraphs}} which are > added to the {{SubmittedJobGraphStore}} if HA mode is enabled. Locking the > {{JobGraphs}} can prevent their cleanup leaving the system in an inconsistent > state. > The problem is that we recover in the > {{SubmittedJobGraphListener#onAddedJobGraph}} callback which is also called > if don't have the leadership the newly added {{JobGraph}}. Recovering the > {{JobGraph}} currently locks the {{JobGraph}}. In case that the > {{Dispatcher}} is not the leader, then we won't start that job after its > recovery. However, we also don't release the {{JobGraph}} leaving it locked. > There are two possible solutions to the problem. Either we check whether we > are the leader before recovering jobs or we say that recovering jobs does not > lock them. Only if we can submit the recovered job we lock them. The latter > approach has the advantage that it follows a quite similar code path as an > initial job submission. Moreover, jobs are currently also recovered at other > places. In all these places we currently would need to release the > {{JobGraphs}} if we cannot submit the recovered {{JobGraph}} (e.g. > {{Dispatcher#grantLeadership}}). > An extension of the first solution could be to stop the > {{SubmittedJobGraphStore}} while the {{Dispatcher}} is not the leader. Then > we would have to make sure that no concurrent callback from the > {{SubmittedJobGraphStore#SubmittedJobGraphListener}} can be executed after > revoking leadership from the {{Dispatcher}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)