zhijiang created FLINK-5893: ------------------------------- Summary: Race condition in removing previous JobManagerRegistration in ResourceManager Key: FLINK-5893 URL: https://issues.apache.org/jira/browse/FLINK-5893 Project: Flink Issue Type: Bug Components: ResourceManager Reporter: zhijiang
The map of {{JobManagerRegistration}} in {{ResourceManager}} is not thread-safe, and currently there may be two threads to operate the map concurrently to bring unexpected results. The scenario is like this : {{registerJobManager}}: When the job leader changes and the new JobManager leader registers to ResourceManager, the new {{JobManagerRegistration}} will replace the old one in the map with the same key {{JobID}}. This process is triggered by rpc thread. Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of job leader change and trigger the action {{jobLeaderLostLeadership}} in another thread. In this action, it will remove the previous {{JobManagerRegistration}} from the map by {{JobID}}, but the old {{JobManagerRegistration}} may be already replaced by the new one from {{registerJobManager}}. In summary, this race condition may cause the new {{JobManagerRegistration}} removed from ResourceManager, resulting in exception when request slot from ResourceManager. Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for the map. -- This message was sent by Atlassian JIRA (v6.3.15#6346)