Hi Matthias,

Thanks for the proposal! I am in favor of cleaning up this interface, and
It seems a bit cumbersome now. Especially, the implementation of
per-component leader election has been removed from our current code path.

To be honest, I don't like the per-component approach. I'm even often asked
why flink used this way? Of course, I admit that this will make our HA
service more flexible. But personally, perhaps the per-process solution is
more better, at least from the perspective of reducing potential problems
like FLINK-24038, and it can definitely reduce the complexity of JobManager.

Regarding "We lose some flexibility in terms of per-component
LeaderElection ", I am curious that there are so many such extreme
requirements that we have to rely on the per-component pattern to achieve
them? If there are, is this requirement really reasonable, and users may
inadvertently recreate problems similar to FLINK-24038.

Best regards,

Weijie


Matthias Pohl <matthias.p...@aiven.io.invalid> 于2022年12月9日周五 17:47写道:

> Hi Dong,
> see my answers below.
>
> Regarding "Interface change might affect other projects that customize HA
> > services", are you referring to those projects which hack into Flink's
> > source code (as opposed to using Flink's public API) to customize HA
> > services?
>
>
> Yes, the proposed change might affect projects that need to have their own
> HA implementation for whatever reason (interface change) or if a project
> accesses the HA backend to retrieve metadata from the ZK node/k8s ConfigMap
> (change about how the data is stored in the HA backend). The latter one was
> actually already the case with the change introduced by FLINK-24038 [1].
>
> By the way, since Flink already supports zookeeper and kubernetes as the
> > high availability services, are you aware of many projects that still
> need
> > to hack into Flink's code to customize high availability services?
>
>
> I am aware of projects that use customized HA. But based on our experience
> in FLINK-24038 [1] no one complained. So, making people aware through the
> mailing list might be good enough.
>
> And regarding "We lose some flexibility in terms of per-component
> > LeaderElection", could you explain what flexibility we need so that we
> can
> > gauge the associated downside of losing the flexibility?
>
>
> Just to recap: The current interface allows having per-component
> LeaderElection (e.g. the ResourceManager leader can run on a different
> JobManager than the Dispatcher). This implementation was replaced by
> FLINK-24038 [1] and removed in FLINK-25806 [2]. The new implementation does
> LeaderElection per process (e.g. ResourceManager and Dispatcher always run
> on the same JobManager). The changed interface would require us to touch
> the interface again if (for whatever reason) we want to reintroduce
> per-component leader election in some form.
> The interface change is, strictly speaking, not necessary to provide the
> new functionality. But I like the idea of certain requirements (currently,
> we need per-process leader election to fix what was reported in FLINK-24038
> [1]) being reflected in the interface. This makes sure that we don't
> introduce a per-component leader election again accidentally in the future
> because we thought it's a good idea but forgot about FLINK-24038.
>
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-24038
> [2] https://issues.apache.org/jira/browse/FLINK-25806
>
> On Fri, Dec 9, 2022 at 2:09 AM Dong Lin <lindon...@gmail.com> wrote:
>
> > Hi Matthias,
> >
> > Thanks for the proposal! Overall I am in favor of making this interface
> > change to make Flink's codebase more maintainable.
> >
> > Regarding "Interface change might affect other projects that customize HA
> > services", are you referring to those projects which hack into Flink's
> > source code (as opposed to using Flink's public API) to customize HA
> > services? If yes, it seems OK to break those projects since we don't have
> > any backward compatibility guarantee for those projects.
> >
> > By the way, since Flink already supports zookeeper and kubernetes as the
> > high availability services, are you aware of many projects that still
> need
> > to hack into Flink's code to customize high availability services?
> >
> > And regarding "We lose some flexibility in terms of per-component
> > LeaderElection", could you explain what flexibility we need so that we
> can
> > gauge the associated downside of losing the flexibility?
> >
> > Thanks!
> > Dong
> >
> >
> >
> > On Wed, Dec 7, 2022 at 4:28 PM Matthias Pohl <matthias.p...@aiven.io
> > .invalid>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > The Flink community introduced a new way how leader election works in
> > Flink
> > > 1.15 with FLINK-24038 [1]. Instead of a per-component leader election,
> > all
> > > components (i.e. ResourceManager, Dispatcher, REST server, JobMaster)
> > use a
> > > single (per-JM-process) leader election instance. It was meant to fix
> > some
> > > issues with deregistering Flink applications in multi-JM setups [1] and
> > > reduce load on the HA backend. Users were able to opt-out and switch
> back
> > > to the old implementation [2].
> > >
> > > The new approach was kind of complicated to implement while still
> > > maintaining support for the old implementation through the existing
> > > interfaces. With FLINK-25806 [3], the old implementation was removed in
> > > Flink 1.16. This enables us to clean things up in the
> > > HighAvailabilityServices.
> > >
> > > The proposed change would mean touching the HighAvailabilityServices
> > > interface. Currently, the interface provides factory methods for
> > > LeaderElectionServices of the aforementioned components. All of these
> > > LeaderElectionServices are internally based on the same LeaderElection
> > > instance handled in DefaultMultipleComponentLeaderElectionService.
> > > Therefore, we can replace all these factory methods by a single one
> which
> > > returns a LeaderElectionService instance that’s going to be used by all
> > > components. Of course, we could also stick to the old
> > > HighAvailabilityServices and return the same LeaderElectionService
> > instance
> > > through each of the four factory methods (similar to what’s done now
> with
> > > the MultipleComponentLeaderElectionService).
> > >
> > > A similar question appears for the corresponding
> LeaderRetrievalService:
> > We
> > > could create a single listener instead of having individual
> per-component
> > > listeners to reflect the current requirement of having a per-JM-process
> > > leader election and align it with the LeaderElectionService approach
> (if
> > we
> > > decide on modifying the HA interface).
> > >
> > > I didn’t come up with a dedicated FLIP: HighAvailabilityServices are
> not
> > > considered a public interface. Still, I am aware it might affect users
> > > (e.g. if they implemented their own HA services or if the project
> > monitors
> > > HA information in the HA backend outside of Flink). That’s why I wanted
> > to
> > > start a discussion here. I’m happy to create a FLIP, if someone thinks
> > it’s
> > > worth it. The work is going to be covered by FLINK-26522 [4]
> > >
> > > Pro’s (for changing the interface methods):
> > >
> > >    -
> > >
> > >    It reflects the requirements stated in FLINK-24038 [1] about having
> a
> > >    per-JM-process LeaderElection
> > >    -
> > >
> > >    It helps reducing the complexity of the JobManager
> > >
> > > Con’s:
> > >
> > >    -
> > >
> > >    We lose some flexibility in terms of per-component LeaderElection
> > >    -
> > >
> > >    Interface change might affect other projects that customize HA
> > services
> > >
> > >
> > > I’m in favor of reducing the amount of factory methods in
> > > HighAvailabilityServices considering that it’s not a public interface.
> > I’m
> > > looking forward to your opinions.
> > >
> > > Matthias
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-24038
> > >
> > > [2]
> > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/config/#high-availability-use-old-ha-services
> > >
> > > [3] https://issues.apache.org/jira/browse/FLINK-25806
> > >
> > > [4] https://issues.apache.org/jira/browse/FLINK-26522
> > >
> > >
> > > --
> > >
> > > [image: Aiven]
> > >
> > > Matthias Pohl
> > >
> > > Software Engineer, Aiven
> > >
> > > matthias.p...@aiven.io <i...@aiven.io>
> > >
> > > aiven.io <https://www.aiven.io>   |
> > > <https://www.facebook.com/aivencloud> <
> > > https://www.facebook.com/aivencloud/>
> > >     <https://www.linkedin.com/company/aiven/>
> > > <https://www.linkedin.com/company/aiven>    <
> > https://twitter.com/aiven_io>
> > > <https://twitter.com/aiven_io>
> > >
> > > Aiven Deutschland GmbH
> > >
> > > Immanuelkirchstraße 26, 10405 Berlin
> > >
> > > Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> > >
> > > Amtsgericht Charlottenburg, HRB 209739 B
> > >
> >
>

Reply via email to