Thanks Yang Wang for sharing your view on this. Please find my responses
below.

# HA data format in the HA backend(e.g. ZK, K8s ConfigMap)
> We have already changed the HA data format after introducing the multiple
> component leader election in FLINK-24038. For K8s HA,
> the num of ConfigMaps reduced from 4 to 2. Since we only have one leader
> elector, the K8s APIServer load should also be reduced.
> Why do we still need to change the format again? This just prevents the
> LAST_STATE upgrade mode in Flink-Kubernetes-Operator
> when the Flink version changed, even though it is a simple job and state is
> compatible.
>

The intention of this remark is that we could reduce the number of
redundant records (the ZooKeeperMultipleComponentLeaderElectionHaServices'
JavaDoc [1] visualizes the redundancy quite well since each of these
connection_info records would contain the very same information). We're
saving the same connection_info for each of the componentIds (e.g.
resource_manager, dispatcher, ...) right now. My rationale was that we only
need to save the connection info once per LeaderElectionDriver, i.e.
LeaderElectionService. It's an implementation detail of the
LeaderElectionService implementation to know what components it owns.
Therefore, I suggested that we would have a unique ID per
LeaderElectionService instance with a single connection_info that is used
by all the components that are registered to that service. If we decide to
have a separate LeaderElectionService for a specific component (e.g. the
resource manager) in the future, this would end up having a separate
ConfigMap in k8s or separate ZNode in ZooKeeper.

I added these details to the FLIP [2]. That part, indeed, was quite poorly
described there initally.

I don't understand how the leader election affects the LAST_STATE changes
in the Kubernetes Operator, though. We use a separate ConfigMap for the
checkpoint data [3]. Can you elaborate a little bit more on your concern?

[1]
https://github.com/apache/flink/blob/8ddfd590ebba7fc727e79db41b82d3d40a02b56a/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperMultipleComponentLeaderElectionHaServices.java#L47-L61
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box#ha-backend-data-schema
[3]
https://github.com/apache/flink/blob/2770acee1bc4a82a2f4223d4a4cd6073181dc840/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesMultipleComponentLeaderElectionHaServices.java#L163


> # LeaderContender#initiateLeaderElection
> I do not get your point why we need the *initiateLeaderElection* in
> *LeaderContender*. AFAICS, the callback *onGrant/RevokeLeadership*
> could be executed as soon as the registration.
>

That's the change I'm not really happy about. I'm still not sure whether I
found the best solution here. The problem is the way the components are
initialized. The initial plan was to call the
LeaderElectionService.register(LeaderContender) from within the
LeaderContender constructor which would return the LeaderElection instance
that would be used as the adapter for the contender to confirm leadership.
Therefore, the LeaderContender has to own the LeaderElection instance to be
able to participate in the leader election handshake (i.e. grant leadership
-> confirm leadership). With that setup, we wouldn't be able to grant
leadership during the LeaderElectionService.register(LeaderContender) call
because that would require for the LeaderContender to confirm the
leadership. That is not possible because the LeaderElection wasn't created
and set within the LeaderContender, yet. Therefore, we have to have some
means to initialize the LeaderElection before triggering granting the
leadership from within the LeaderElectionService. ...and that's what this
method is for.

Best,
Matthias

Reply via email to