Thanks Yang Wang for sharing your view on this. Please find my responses below.
# HA data format in the HA backend(e.g. ZK, K8s ConfigMap) > We have already changed the HA data format after introducing the multiple > component leader election in FLINK-24038. For K8s HA, > the num of ConfigMaps reduced from 4 to 2. Since we only have one leader > elector, the K8s APIServer load should also be reduced. > Why do we still need to change the format again? This just prevents the > LAST_STATE upgrade mode in Flink-Kubernetes-Operator > when the Flink version changed, even though it is a simple job and state is > compatible. > The intention of this remark is that we could reduce the number of redundant records (the ZooKeeperMultipleComponentLeaderElectionHaServices' JavaDoc [1] visualizes the redundancy quite well since each of these connection_info records would contain the very same information). We're saving the same connection_info for each of the componentIds (e.g. resource_manager, dispatcher, ...) right now. My rationale was that we only need to save the connection info once per LeaderElectionDriver, i.e. LeaderElectionService. It's an implementation detail of the LeaderElectionService implementation to know what components it owns. Therefore, I suggested that we would have a unique ID per LeaderElectionService instance with a single connection_info that is used by all the components that are registered to that service. If we decide to have a separate LeaderElectionService for a specific component (e.g. the resource manager) in the future, this would end up having a separate ConfigMap in k8s or separate ZNode in ZooKeeper. I added these details to the FLIP [2]. That part, indeed, was quite poorly described there initally. I don't understand how the leader election affects the LAST_STATE changes in the Kubernetes Operator, though. We use a separate ConfigMap for the checkpoint data [3]. Can you elaborate a little bit more on your concern? [1] https://github.com/apache/flink/blob/8ddfd590ebba7fc727e79db41b82d3d40a02b56a/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperMultipleComponentLeaderElectionHaServices.java#L47-L61 [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box#ha-backend-data-schema [3] https://github.com/apache/flink/blob/2770acee1bc4a82a2f4223d4a4cd6073181dc840/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesMultipleComponentLeaderElectionHaServices.java#L163 > # LeaderContender#initiateLeaderElection > I do not get your point why we need the *initiateLeaderElection* in > *LeaderContender*. AFAICS, the callback *onGrant/RevokeLeadership* > could be executed as soon as the registration. > That's the change I'm not really happy about. I'm still not sure whether I found the best solution here. The problem is the way the components are initialized. The initial plan was to call the LeaderElectionService.register(LeaderContender) from within the LeaderContender constructor which would return the LeaderElection instance that would be used as the adapter for the contender to confirm leadership. Therefore, the LeaderContender has to own the LeaderElection instance to be able to participate in the leader election handshake (i.e. grant leadership -> confirm leadership). With that setup, we wouldn't be able to grant leadership during the LeaderElectionService.register(LeaderContender) call because that would require for the LeaderContender to confirm the leadership. That is not possible because the LeaderElection wasn't created and set within the LeaderContender, yet. Therefore, we have to have some means to initialize the LeaderElection before triggering granting the leadership from within the LeaderElectionService. ...and that's what this method is for. Best, Matthias