Thanks Matthias for updating the FLIP. Given that we do not touch the dedicated ConfigMap for the checkpoint, this FLIP will not affect the LAST_STATE recovery for the Flink Kubernetes Operator.
# LeaderContender#initiateLeaderElection Could we simply execute *leaderElectionService.register(contender) *just in the same place with current *leaderElectionService.start(contender)*, not within the constructor? Then we could ensure that the *onGrantLeadership* does not happen before the contender is ready. # LeaderElection#confirmLeadership Could you please share more about how *LeaderElection#confirmLeadership *works? It will directly call the *LeaderElectionDriver#writeLeaderInformation*, or *LeaderElectionService* will be the adapter. # LeaderRetrievalService This might be out the scope of this FLIP :) Do we need to introduce a corresponding *MultipleComponentLeaderRetrievalService*, especially when we are using only one ZNode/ConfigMap for storing all the leader information? For Kubernetes HA, we have already use a shared ConfigMap watcher for all the leader retrieval services. However, for ZK HA, each leader retrieval service still has a dedicated *TreeCache*. Best, Yang Matthias Pohl <matthias.p...@aiven.io.invalid> 于2023年1月12日周四 22:07写道: > Thanks Yang Wang for sharing your view on this. Please find my responses > below. > > # HA data format in the HA backend(e.g. ZK, K8s ConfigMap) > > We have already changed the HA data format after introducing the multiple > > component leader election in FLINK-24038. For K8s HA, > > the num of ConfigMaps reduced from 4 to 2. Since we only have one leader > > elector, the K8s APIServer load should also be reduced. > > Why do we still need to change the format again? This just prevents the > > LAST_STATE upgrade mode in Flink-Kubernetes-Operator > > when the Flink version changed, even though it is a simple job and state > is > > compatible. > > > > The intention of this remark is that we could reduce the number of > redundant records (the ZooKeeperMultipleComponentLeaderElectionHaServices' > JavaDoc [1] visualizes the redundancy quite well since each of these > connection_info records would contain the very same information). We're > saving the same connection_info for each of the componentIds (e.g. > resource_manager, dispatcher, ...) right now. My rationale was that we only > need to save the connection info once per LeaderElectionDriver, i.e. > LeaderElectionService. It's an implementation detail of the > LeaderElectionService implementation to know what components it owns. > Therefore, I suggested that we would have a unique ID per > LeaderElectionService instance with a single connection_info that is used > by all the components that are registered to that service. If we decide to > have a separate LeaderElectionService for a specific component (e.g. the > resource manager) in the future, this would end up having a separate > ConfigMap in k8s or separate ZNode in ZooKeeper. > > I added these details to the FLIP [2]. That part, indeed, was quite poorly > described there initally. > > I don't understand how the leader election affects the LAST_STATE changes > in the Kubernetes Operator, though. We use a separate ConfigMap for the > checkpoint data [3]. Can you elaborate a little bit more on your concern? > > [1] > > https://github.com/apache/flink/blob/8ddfd590ebba7fc727e79db41b82d3d40a02b56a/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperMultipleComponentLeaderElectionHaServices.java#L47-L61 > [2] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box#ha-backend-data-schema > [3] > > https://github.com/apache/flink/blob/2770acee1bc4a82a2f4223d4a4cd6073181dc840/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesMultipleComponentLeaderElectionHaServices.java#L163 > > > > # LeaderContender#initiateLeaderElection > > I do not get your point why we need the *initiateLeaderElection* in > > *LeaderContender*. AFAICS, the callback *onGrant/RevokeLeadership* > > could be executed as soon as the registration. > > > > That's the change I'm not really happy about. I'm still not sure whether I > found the best solution here. The problem is the way the components are > initialized. The initial plan was to call the > LeaderElectionService.register(LeaderContender) from within the > LeaderContender constructor which would return the LeaderElection instance > that would be used as the adapter for the contender to confirm leadership. > Therefore, the LeaderContender has to own the LeaderElection instance to be > able to participate in the leader election handshake (i.e. grant leadership > -> confirm leadership). With that setup, we wouldn't be able to grant > leadership during the LeaderElectionService.register(LeaderContender) call > because that would require for the LeaderContender to confirm the > leadership. That is not possible because the LeaderElection wasn't created > and set within the LeaderContender, yet. Therefore, we have to have some > means to initialize the LeaderElection before triggering granting the > leadership from within the LeaderElectionService. ...and that's what this > method is for. > > Best, > Matthias >