Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

Matthias Pohl Tue, 17 Jan 2023 04:30:52 -0800

Thanks Yang for getting back to me.

I checked the connection information that's stored in the HA backend once
more. My previous proposal is based on a wrong assumption: The address we
store is the RPC endpoint's address. That address should be unique per
component which means that we shouldn't change how the HA backend stores
the connection information. We don't store redundant information here. As a
consequence, I had to reiterate over my proposal in FLIP-285. We're now
relying on contender IDs that are going to be used internally for storing
the connection information in the HA backend (analogously to how it's done
in the MultipleComponentLeaderElectionService implementation right now).


Additionally, I came to the conclusion that it's actually not really
necessary to add the LeaderElection interface. Working with the contender
IDs to identify the LeaderContender in the LeaderElectionService might be
good enough. But I still kept the LeaderElection interface as an (optional)
extension of FLIP-285 as it might improve testability a bit. I added some
diagrams and descriptions to the FLIP hoping that this helps answer your
questions, Yang.

Best,
Matthias

On Mon, Jan 16, 2023 at 8:41 AM Yang Wang <danrtsey...@gmail.com> wrote:

> Thanks Matthias for updating the FLIP.
>
> Given that we do not touch the dedicated ConfigMap for the checkpoint, this
> FLIP will not affect the LAST_STATE recovery for the Flink Kubernetes
> Operator.
>
> # LeaderContender#initiateLeaderElection
> Could we simply execute *leaderElectionService.register(contender) *just in
> the same place with current *leaderElectionService.start(contender)*, not
> within the constructor?
> Then we could ensure that the *onGrantLeadership* does not happen before
> the contender is ready.
>
> # LeaderElection#confirmLeadership
> Could you please share more about how *LeaderElection#confirmLeadership
> *works?
> It will directly call the *LeaderElectionDriver#writeLeaderInformation*, or
> *LeaderElectionService*
> will be the adapter.
>
> # LeaderRetrievalService
> This might be out the scope of this FLIP :)
> Do we need to introduce a corresponding
> *MultipleComponentLeaderRetrievalService*, especially when we are using
> only one ZNode/ConfigMap for storing all the leader information?
> For Kubernetes HA, we have already use a shared ConfigMap watcher for all
> the leader retrieval services.
> However, for ZK HA, each leader retrieval service still has a dedicated
> *TreeCache*.
>
>
> Best,
> Yang
>
>
>
> Matthias Pohl <matthias.p...@aiven.io.invalid> 于2023年1月12日周四 22:07写道：
>
> > Thanks Yang Wang for sharing your view on this. Please find my responses
> > below.
> >
> > # HA data format in the HA backend(e.g. ZK, K8s ConfigMap)
> > > We have already changed the HA data format after introducing the
> multiple
> > > component leader election in FLINK-24038. For K8s HA,
> > > the num of ConfigMaps reduced from 4 to 2. Since we only have one
> leader
> > > elector, the K8s APIServer load should also be reduced.
> > > Why do we still need to change the format again? This just prevents the
> > > LAST_STATE upgrade mode in Flink-Kubernetes-Operator
> > > when the Flink version changed, even though it is a simple job and
> state
> > is
> > > compatible.
> > >
> >
> > The intention of this remark is that we could reduce the number of
> > redundant records (the
> ZooKeeperMultipleComponentLeaderElectionHaServices'
> > JavaDoc [1] visualizes the redundancy quite well since each of these
> > connection_info records would contain the very same information). We're
> > saving the same connection_info for each of the componentIds (e.g.
> > resource_manager, dispatcher, ...) right now. My rationale was that we
> only
> > need to save the connection info once per LeaderElectionDriver, i.e.
> > LeaderElectionService. It's an implementation detail of the
> > LeaderElectionService implementation to know what components it owns.
> > Therefore, I suggested that we would have a unique ID per
> > LeaderElectionService instance with a single connection_info that is used
> > by all the components that are registered to that service. If we decide
> to
> > have a separate LeaderElectionService for a specific component (e.g. the
> > resource manager) in the future, this would end up having a separate
> > ConfigMap in k8s or separate ZNode in ZooKeeper.
> >
> > I added these details to the FLIP [2]. That part, indeed, was quite
> poorly
> > described there initally.
> >
> > I don't understand how the leader election affects the LAST_STATE changes
> > in the Kubernetes Operator, though. We use a separate ConfigMap for the
> > checkpoint data [3]. Can you elaborate a little bit more on your concern?
> >
> > [1]
> >
> >
> https://github.com/apache/flink/blob/8ddfd590ebba7fc727e79db41b82d3d40a02b56a/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperMultipleComponentLeaderElectionHaServices.java#L47-L61
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box#ha-backend-data-schema
> > [3]
> >
> >
> https://github.com/apache/flink/blob/2770acee1bc4a82a2f4223d4a4cd6073181dc840/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesMultipleComponentLeaderElectionHaServices.java#L163
> >
> >
> > > # LeaderContender#initiateLeaderElection
> > > I do not get your point why we need the *initiateLeaderElection* in
> > > *LeaderContender*. AFAICS, the callback *onGrant/RevokeLeadership*
> > > could be executed as soon as the registration.
> > >
> >
> > That's the change I'm not really happy about. I'm still not sure whether
> I
> > found the best solution here. The problem is the way the components are
> > initialized. The initial plan was to call the
> > LeaderElectionService.register(LeaderContender) from within the
> > LeaderContender constructor which would return the LeaderElection
> instance
> > that would be used as the adapter for the contender to confirm
> leadership.
> > Therefore, the LeaderContender has to own the LeaderElection instance to
> be
> > able to participate in the leader election handshake (i.e. grant
> leadership
> > -> confirm leadership). With that setup, we wouldn't be able to grant
> > leadership during the LeaderElectionService.register(LeaderContender)
> call
> > because that would require for the LeaderContender to confirm the
> > leadership. That is not possible because the LeaderElection wasn't
> created
> > and set within the LeaderContender, yet. Therefore, we have to have some
> > means to initialize the LeaderElection before triggering granting the
> > leadership from within the LeaderElectionService. ...and that's what this
> > method is for.
> >
> > Best,
> > Matthias
> >
>

Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

Reply via email to