Thanks Matthias for the detailed explanation. For the HA backend data structure, you are right. Even the different components are running in a same JVM, they have completely different connection infos. But it is not urgent to use a single ZNode to store multiple connection entries for now.
I lean towards not to introduce the *LeaderElection *since it does not take many benefits. BTW, if the *LeaderElectionService#register(return LeaderElection)* and *LeaderElectionService#onGrantLeadership* are guarded by a same lock, then we could ensure that the leaderElection in *LeaderContender* is always non-null when it tries to confirm the leadership. And then we do not need the *LeaderContender#initializeLeaderElection*. Right? Best, Yang Matthias Pohl <matthias.p...@aiven.io.invalid> 于2023年1月17日周二 20:31写道: > Thanks Yang for getting back to me. > > I checked the connection information that's stored in the HA backend once > more. My previous proposal is based on a wrong assumption: The address we > store is the RPC endpoint's address. That address should be unique per > component which means that we shouldn't change how the HA backend stores > the connection information. We don't store redundant information here. As a > consequence, I had to reiterate over my proposal in FLIP-285. We're now > relying on contender IDs that are going to be used internally for storing > the connection information in the HA backend (analogously to how it's done > in the MultipleComponentLeaderElectionService implementation right now). > > Additionally, I came to the conclusion that it's actually not really > necessary to add the LeaderElection interface. Working with the contender > IDs to identify the LeaderContender in the LeaderElectionService might be > good enough. But I still kept the LeaderElection interface as an (optional) > extension of FLIP-285 as it might improve testability a bit. I added some > diagrams and descriptions to the FLIP hoping that this helps answer your > questions, Yang. > > Best, > Matthias > > On Mon, Jan 16, 2023 at 8:41 AM Yang Wang <danrtsey...@gmail.com> wrote: > > > Thanks Matthias for updating the FLIP. > > > > Given that we do not touch the dedicated ConfigMap for the checkpoint, > this > > FLIP will not affect the LAST_STATE recovery for the Flink Kubernetes > > Operator. > > > > # LeaderContender#initiateLeaderElection > > Could we simply execute *leaderElectionService.register(contender) *just > in > > the same place with current *leaderElectionService.start(contender)*, not > > within the constructor? > > Then we could ensure that the *onGrantLeadership* does not happen before > > the contender is ready. > > > > # LeaderElection#confirmLeadership > > Could you please share more about how *LeaderElection#confirmLeadership > > *works? > > It will directly call the *LeaderElectionDriver#writeLeaderInformation*, > or > > *LeaderElectionService* > > will be the adapter. > > > > # LeaderRetrievalService > > This might be out the scope of this FLIP :) > > Do we need to introduce a corresponding > > *MultipleComponentLeaderRetrievalService*, especially when we are using > > only one ZNode/ConfigMap for storing all the leader information? > > For Kubernetes HA, we have already use a shared ConfigMap watcher for all > > the leader retrieval services. > > However, for ZK HA, each leader retrieval service still has a dedicated > > *TreeCache*. > > > > > > Best, > > Yang > > > > > > > > Matthias Pohl <matthias.p...@aiven.io.invalid> 于2023年1月12日周四 22:07写道: > > > > > Thanks Yang Wang for sharing your view on this. Please find my > responses > > > below. > > > > > > # HA data format in the HA backend(e.g. ZK, K8s ConfigMap) > > > > We have already changed the HA data format after introducing the > > multiple > > > > component leader election in FLINK-24038. For K8s HA, > > > > the num of ConfigMaps reduced from 4 to 2. Since we only have one > > leader > > > > elector, the K8s APIServer load should also be reduced. > > > > Why do we still need to change the format again? This just prevents > the > > > > LAST_STATE upgrade mode in Flink-Kubernetes-Operator > > > > when the Flink version changed, even though it is a simple job and > > state > > > is > > > > compatible. > > > > > > > > > > The intention of this remark is that we could reduce the number of > > > redundant records (the > > ZooKeeperMultipleComponentLeaderElectionHaServices' > > > JavaDoc [1] visualizes the redundancy quite well since each of these > > > connection_info records would contain the very same information). We're > > > saving the same connection_info for each of the componentIds (e.g. > > > resource_manager, dispatcher, ...) right now. My rationale was that we > > only > > > need to save the connection info once per LeaderElectionDriver, i.e. > > > LeaderElectionService. It's an implementation detail of the > > > LeaderElectionService implementation to know what components it owns. > > > Therefore, I suggested that we would have a unique ID per > > > LeaderElectionService instance with a single connection_info that is > used > > > by all the components that are registered to that service. If we decide > > to > > > have a separate LeaderElectionService for a specific component (e.g. > the > > > resource manager) in the future, this would end up having a separate > > > ConfigMap in k8s or separate ZNode in ZooKeeper. > > > > > > I added these details to the FLIP [2]. That part, indeed, was quite > > poorly > > > described there initally. > > > > > > I don't understand how the leader election affects the LAST_STATE > changes > > > in the Kubernetes Operator, though. We use a separate ConfigMap for the > > > checkpoint data [3]. Can you elaborate a little bit more on your > concern? > > > > > > [1] > > > > > > > > > https://github.com/apache/flink/blob/8ddfd590ebba7fc727e79db41b82d3d40a02b56a/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperMultipleComponentLeaderElectionHaServices.java#L47-L61 > > > [2] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box#ha-backend-data-schema > > > [3] > > > > > > > > > https://github.com/apache/flink/blob/2770acee1bc4a82a2f4223d4a4cd6073181dc840/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesMultipleComponentLeaderElectionHaServices.java#L163 > > > > > > > > > > # LeaderContender#initiateLeaderElection > > > > I do not get your point why we need the *initiateLeaderElection* in > > > > *LeaderContender*. AFAICS, the callback *onGrant/RevokeLeadership* > > > > could be executed as soon as the registration. > > > > > > > > > > That's the change I'm not really happy about. I'm still not sure > whether > > I > > > found the best solution here. The problem is the way the components are > > > initialized. The initial plan was to call the > > > LeaderElectionService.register(LeaderContender) from within the > > > LeaderContender constructor which would return the LeaderElection > > instance > > > that would be used as the adapter for the contender to confirm > > leadership. > > > Therefore, the LeaderContender has to own the LeaderElection instance > to > > be > > > able to participate in the leader election handshake (i.e. grant > > leadership > > > -> confirm leadership). With that setup, we wouldn't be able to grant > > > leadership during the LeaderElectionService.register(LeaderContender) > > call > > > because that would require for the LeaderContender to confirm the > > > leadership. That is not possible because the LeaderElection wasn't > > created > > > and set within the LeaderContender, yet. Therefore, we have to have > some > > > means to initialize the LeaderElection before triggering granting the > > > leadership from within the LeaderElectionService. ...and that's what > this > > > method is for. > > > > > > Best, > > > Matthias > > > > > >