Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

Yang Wang Sun, 15 Jan 2023 23:41:39 -0800

Thanks Matthias for updating the FLIP.

Given that we do not touch the dedicated ConfigMap for the checkpoint, this
FLIP will not affect the LAST_STATE recovery for the Flink Kubernetes
Operator.


# LeaderContender#initiateLeaderElection
Could we simply execute *leaderElectionService.register(contender) *just in
the same place with current *leaderElectionService.start(contender)*, not
within the constructor?
Then we could ensure that the *onGrantLeadership* does not happen before
the contender is ready.

# LeaderElection#confirmLeadership
Could you please share more about how *LeaderElection#confirmLeadership *works?
It will directly call the *LeaderElectionDriver#writeLeaderInformation*, or
*LeaderElectionService*
will be the adapter.

# LeaderRetrievalService
This might be out the scope of this FLIP :)
Do we need to introduce a corresponding
*MultipleComponentLeaderRetrievalService*, especially when we are using
only one ZNode/ConfigMap for storing all the leader information?
For Kubernetes HA, we have already use a shared ConfigMap watcher for all
the leader retrieval services.
However, for ZK HA, each leader retrieval service still has a dedicated
*TreeCache*.


Best,
Yang



Matthias Pohl <[email protected]> 于2023年1月12日周四 22:07写道：

> Thanks Yang Wang for sharing your view on this. Please find my responses
> below.
>
> # HA data format in the HA backend(e.g. ZK, K8s ConfigMap)
> > We have already changed the HA data format after introducing the multiple
> > component leader election in FLINK-24038. For K8s HA,
> > the num of ConfigMaps reduced from 4 to 2. Since we only have one leader
> > elector, the K8s APIServer load should also be reduced.
> > Why do we still need to change the format again? This just prevents the
> > LAST_STATE upgrade mode in Flink-Kubernetes-Operator
> > when the Flink version changed, even though it is a simple job and state
> is
> > compatible.
> >
>
> The intention of this remark is that we could reduce the number of
> redundant records (the ZooKeeperMultipleComponentLeaderElectionHaServices'
> JavaDoc [1] visualizes the redundancy quite well since each of these
> connection_info records would contain the very same information). We're
> saving the same connection_info for each of the componentIds (e.g.
> resource_manager, dispatcher, ...) right now. My rationale was that we only
> need to save the connection info once per LeaderElectionDriver, i.e.
> LeaderElectionService. It's an implementation detail of the
> LeaderElectionService implementation to know what components it owns.
> Therefore, I suggested that we would have a unique ID per
> LeaderElectionService instance with a single connection_info that is used
> by all the components that are registered to that service. If we decide to
> have a separate LeaderElectionService for a specific component (e.g. the
> resource manager) in the future, this would end up having a separate
> ConfigMap in k8s or separate ZNode in ZooKeeper.
>
> I added these details to the FLIP [2]. That part, indeed, was quite poorly
> described there initally.
>
> I don't understand how the leader election affects the LAST_STATE changes
> in the Kubernetes Operator, though. We use a separate ConfigMap for the
> checkpoint data [3]. Can you elaborate a little bit more on your concern?
>
> [1]
>
> https://github.com/apache/flink/blob/8ddfd590ebba7fc727e79db41b82d3d40a02b56a/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperMultipleComponentLeaderElectionHaServices.java#L47-L61
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box#ha-backend-data-schema
> [3]
>
> https://github.com/apache/flink/blob/2770acee1bc4a82a2f4223d4a4cd6073181dc840/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesMultipleComponentLeaderElectionHaServices.java#L163
>
>
> > # LeaderContender#initiateLeaderElection
> > I do not get your point why we need the *initiateLeaderElection* in
> > *LeaderContender*. AFAICS, the callback *onGrant/RevokeLeadership*
> > could be executed as soon as the registration.
> >
>
> That's the change I'm not really happy about. I'm still not sure whether I
> found the best solution here. The problem is the way the components are
> initialized. The initial plan was to call the
> LeaderElectionService.register(LeaderContender) from within the
> LeaderContender constructor which would return the LeaderElection instance
> that would be used as the adapter for the contender to confirm leadership.
> Therefore, the LeaderContender has to own the LeaderElection instance to be
> able to participate in the leader election handshake (i.e. grant leadership
> -> confirm leadership). With that setup, we wouldn't be able to grant
> leadership during the LeaderElectionService.register(LeaderContender) call
> because that would require for the LeaderContender to confirm the
> leadership. That is not possible because the LeaderElection wasn't created
> and set within the LeaderContender, yet. Therefore, we have to have some
> means to initialize the LeaderElection before triggering granting the
> leadership from within the LeaderElectionService. ...and that's what this
> method is for.
>
> Best,
> Matthias
>

Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

Reply via email to