Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

Yang Wang Tue, 17 Jan 2023 23:21:07 -0800

Thanks Matthias for the detailed explanation.

For the HA backend data structure, you are right. Even the different
components are running in a same JVM,
they have completely different connection infos. But it is not urgent to
use a single ZNode to store multiple connection entries for now.


I lean towards not to introduce the *LeaderElection *since it does not take
many benefits.

BTW, if the *LeaderElectionService#register(return LeaderElection)* and
*LeaderElectionService#onGrantLeadership* are guarded by a same lock, then
we could
ensure that the leaderElection in *LeaderContender* is always non-null
when it tries
to confirm the leadership. And then we do not need the
*LeaderContender#initializeLeaderElection*.
Right?


Best,
Yang

Matthias Pohl <matthias.p...@aiven.io.invalid> 于2023年1月17日周二 20:31写道：

> Thanks Yang for getting back to me.
>
> I checked the connection information that's stored in the HA backend once
> more. My previous proposal is based on a wrong assumption: The address we
> store is the RPC endpoint's address. That address should be unique per
> component which means that we shouldn't change how the HA backend stores
> the connection information. We don't store redundant information here. As a
> consequence, I had to reiterate over my proposal in FLIP-285. We're now
> relying on contender IDs that are going to be used internally for storing
> the connection information in the HA backend (analogously to how it's done
> in the MultipleComponentLeaderElectionService implementation right now).
>
> Additionally, I came to the conclusion that it's actually not really
> necessary to add the LeaderElection interface. Working with the contender
> IDs to identify the LeaderContender in the LeaderElectionService might be
> good enough. But I still kept the LeaderElection interface as an (optional)
> extension of FLIP-285 as it might improve testability a bit. I added some
> diagrams and descriptions to the FLIP hoping that this helps answer your
> questions, Yang.
>
> Best,
> Matthias
>
> On Mon, Jan 16, 2023 at 8:41 AM Yang Wang <danrtsey...@gmail.com> wrote:
>
> > Thanks Matthias for updating the FLIP.
> >
> > Given that we do not touch the dedicated ConfigMap for the checkpoint,
> this
> > FLIP will not affect the LAST_STATE recovery for the Flink Kubernetes
> > Operator.
> >
> > # LeaderContender#initiateLeaderElection
> > Could we simply execute *leaderElectionService.register(contender) *just
> in
> > the same place with current *leaderElectionService.start(contender)*, not
> > within the constructor?
> > Then we could ensure that the *onGrantLeadership* does not happen before
> > the contender is ready.
> >
> > # LeaderElection#confirmLeadership
> > Could you please share more about how *LeaderElection#confirmLeadership
> > *works?
> > It will directly call the *LeaderElectionDriver#writeLeaderInformation*,
> or
> > *LeaderElectionService*
> > will be the adapter.
> >
> > # LeaderRetrievalService
> > This might be out the scope of this FLIP :)
> > Do we need to introduce a corresponding
> > *MultipleComponentLeaderRetrievalService*, especially when we are using
> > only one ZNode/ConfigMap for storing all the leader information?
> > For Kubernetes HA, we have already use a shared ConfigMap watcher for all
> > the leader retrieval services.
> > However, for ZK HA, each leader retrieval service still has a dedicated
> > *TreeCache*.
> >
> >
> > Best,
> > Yang
> >
> >
> >
> > Matthias Pohl <matthias.p...@aiven.io.invalid> 于2023年1月12日周四 22:07写道：
> >
> > > Thanks Yang Wang for sharing your view on this. Please find my
> responses
> > > below.
> > >
> > > # HA data format in the HA backend(e.g. ZK, K8s ConfigMap)
> > > > We have already changed the HA data format after introducing the
> > multiple
> > > > component leader election in FLINK-24038. For K8s HA,
> > > > the num of ConfigMaps reduced from 4 to 2. Since we only have one
> > leader
> > > > elector, the K8s APIServer load should also be reduced.
> > > > Why do we still need to change the format again? This just prevents
> the
> > > > LAST_STATE upgrade mode in Flink-Kubernetes-Operator
> > > > when the Flink version changed, even though it is a simple job and
> > state
> > > is
> > > > compatible.
> > > >
> > >
> > > The intention of this remark is that we could reduce the number of
> > > redundant records (the
> > ZooKeeperMultipleComponentLeaderElectionHaServices'
> > > JavaDoc [1] visualizes the redundancy quite well since each of these
> > > connection_info records would contain the very same information). We're
> > > saving the same connection_info for each of the componentIds (e.g.
> > > resource_manager, dispatcher, ...) right now. My rationale was that we
> > only
> > > need to save the connection info once per LeaderElectionDriver, i.e.
> > > LeaderElectionService. It's an implementation detail of the
> > > LeaderElectionService implementation to know what components it owns.
> > > Therefore, I suggested that we would have a unique ID per
> > > LeaderElectionService instance with a single connection_info that is
> used
> > > by all the components that are registered to that service. If we decide
> > to
> > > have a separate LeaderElectionService for a specific component (e.g.
> the
> > > resource manager) in the future, this would end up having a separate
> > > ConfigMap in k8s or separate ZNode in ZooKeeper.
> > >
> > > I added these details to the FLIP [2]. That part, indeed, was quite
> > poorly
> > > described there initally.
> > >
> > > I don't understand how the leader election affects the LAST_STATE
> changes
> > > in the Kubernetes Operator, though. We use a separate ConfigMap for the
> > > checkpoint data [3]. Can you elaborate a little bit more on your
> concern?
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/flink/blob/8ddfd590ebba7fc727e79db41b82d3d40a02b56a/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperMultipleComponentLeaderElectionHaServices.java#L47-L61
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box#ha-backend-data-schema
> > > [3]
> > >
> > >
> >
> https://github.com/apache/flink/blob/2770acee1bc4a82a2f4223d4a4cd6073181dc840/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesMultipleComponentLeaderElectionHaServices.java#L163
> > >
> > >
> > > > # LeaderContender#initiateLeaderElection
> > > > I do not get your point why we need the *initiateLeaderElection* in
> > > > *LeaderContender*. AFAICS, the callback *onGrant/RevokeLeadership*
> > > > could be executed as soon as the registration.
> > > >
> > >
> > > That's the change I'm not really happy about. I'm still not sure
> whether
> > I
> > > found the best solution here. The problem is the way the components are
> > > initialized. The initial plan was to call the
> > > LeaderElectionService.register(LeaderContender) from within the
> > > LeaderContender constructor which would return the LeaderElection
> > instance
> > > that would be used as the adapter for the contender to confirm
> > leadership.
> > > Therefore, the LeaderContender has to own the LeaderElection instance
> to
> > be
> > > able to participate in the leader election handshake (i.e. grant
> > leadership
> > > -> confirm leadership). With that setup, we wouldn't be able to grant
> > > leadership during the LeaderElectionService.register(LeaderContender)
> > call
> > > because that would require for the LeaderContender to confirm the
> > > leadership. That is not possible because the LeaderElection wasn't
> > created
> > > and set within the LeaderContender, yet. Therefore, we have to have
> some
> > > means to initialize the LeaderElection before triggering granting the
> > > leadership from within the LeaderElectionService. ...and that's what
> this
> > > method is for.
> > >
> > > Best,
> > > Matthias
> > >
> >
>

Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

Reply via email to