Re: [DISCUSS] Creating an external connector repository

Chesnay Schepler Fri, 15 Oct 2021 06:37:10 -0700

My opinion of splitting the Flink repositories hasn't changed; I'm stillin favor of it.

While it would technically be possible to release individual connectorseven if they are part of the Flink repo,it is quite a hassle to do so and error prone due to the current branchstructure.

A split would also force us to watch out much more for API stability.


I'm gonna assume that we will move out all connectors:

What I'm concerned about, and which we never really covered in pastdiscussions about split repositories, are

a) ways to share infrastructure (e.g., CI/release utilities/codestyle)
b) testing
c) documentation integration

Particularly for b) we still lack any real public utilities.

Even fundamental things such as the MiniClusterResource are notannotated in any way.

I would argue that we need to sort this out before a split can happen.

We've seen with the flink-benchmarks repo and recent discussions howeasily things can break.

Related to that, there is the question on how Flink is then supposed toensure that things don't break. My impression is that we heavily rely onthe connector tests to that end at the moment.Similarly, what connector (version) would be used for examples (like theWordCount which reads from Kafka) or (e2e) tests that want to readsomething other than a file? You end up with this circular dependencywhich are always troublesome.

As for for the repo structure, I would think that a single one couldwork quite well (because having 10+ connector repositories is just amess), but currently I wouldn't set it up as a single project.I would rather have something like N + 1 projects (one for eachconnectors + a shared testing project) which are released individuallyas required, without any snapshot dependencies in-between.Then 1 branch for each major Flink version (again, no snapshotdependencies). Individual connectors can be released at any time againstany of the latest bugfix releases, which due to lack of binaries (andpython releases) would be a breeze.

I don't like the idea of moving existing connectors out of the Apacheorganization. At the very least, not all of them. While some arecertainly ill-maintained (e.g., Cassandra) where it would be neat ifexternal projects could maintain them, others (like Kafka) are not andquite fundamental to actually using Flink.


On 15/10/2021 14:47, Arvid Heise wrote:

Dear community,

Today I would like to kickstart a series of discussions around creating an
external connector repository. The main idea is to decouple the release
cycle of Flink with the release cycles of the connectors. This is a common
approach in other big data analytics projects and seems to scale better
than the current approach. In particular, it will yield the following
changes.


    -

    Faster releases of connectors: New features can be added more quickly,
    bugs can be fixed immediately, and we can have faster security patches in
    case of direct or indirect (through dependencies) security flaws.
    -

    New features can be added to old Flink versions: If the connector API
    didn’t change, the same connector jar may be used with different Flink
    versions. Thus, new features can also immediately be used with older Flink
    versions. A compatibility matrix on each connector page will help users to
    find suitable connector versions for their Flink versions.
    -

    More activity and contributions around connectors: If we ease the
    contribution and development process around connectors, we will see faster
    development and also more connectors. Since that heavily depends on the
    chosen approach discussed below, more details will be shown there.
    -

    An overhaul of the connector page: In the future, all known connectors
    will be shown on the same page in a similar layout independent of where
    they reside. They could be hosted on external project pages (e.g., Iceberg
    and Hudi), on some company page, or may stay within the main Flink reposi
    tory. Connectors may receive some sort of quality seal such that users
    can quickly access the production-readiness and we could also add which
    community/company promises which kind of support.
    -

    If we take out (some) connectors out of Flink, Flink CI will be faster
    and Flink devs will experience less build stabilities (which mostly come
    from connectors). That would also speed up Flink development.


Now I’d first like to collect your viewpoints on the ideal state. Let’s
first recap which approaches, we currently have:


    -

    We have half of the connectors in the main Flink repository. Relatively
    few of them have received updates in the past couple of months.
    -

    Another large chunk of connectors are in Apache Bahir. It recently has
    seen the first release in 3 years.
    -

    There are a few other (Apache) projects that maintain a Flink connector,
    such as Apache Iceberg, Apache Hudi, and Pravega.
    -

    A few connectors are listed on company-related repositories, such as
    Apache Pulsar on StreamNative and CDC connectors on Ververica.


My personal observation is that having a repository per connector seems to
increase the activity on a connector as it’s easier to maintain. For
example, in Apache Bahir all connectors are built against the same Flink
version, which may not be desirable when certain APIs change; for example,
SinkFunction will be eventually deprecated and removed but new Sink
interface may gain more features.

Now, I'd like to outline different approaches. All approaches will allow
you to host your connector on any kind of personal, project, or company
repository. We still want to provide a default place where users can
contribute their connectors and hopefully grow a community around it. The
approaches are:


    1.

    Create a mono-repo under the Apache umbrella where all connectors will
    reside, for example, github.com/apache/flink-connectors. That repository
    needs to follow its rules: No GitHub issues, no Dependabot or similar
    tools, and a strict manual release process. It would be under the Flink
    community, such that Flink committers can write to that repository but
    no-one else.
    2.

    Create a GitHub organization with small repositories, for example
    github.com/flink-connectors. Since it’s not under the Apache umbrella,
    we are free to use whatever process we deem best (up to a future
    discussion). Each repository can have a shared list of maintainers +
    connector specific committers. We can provide more automation. We may even
    allow different licenses to incorporate things like a connector to Oracle
    that cannot be released under ASL.
    3.

    ??? <- please provide your additional approaches


In both cases, we will provide opinionated module/repository templates
based on a connector testing framework and guidelines. Depending on the
approach, we may need to enforce certain things.

I’d like to first focus on what the community would ideally seek and
minimize the discussions around legal issues, which we would discuss later.
For now, I’d also like to postpone the discussion if we move all or only a
subset of connectors from Flink to the new default place as it seems to be
orthogonal to the fundamental discussion.

PS: If the external repository for connectors is successful, I’d also like
to move out other things like formats, filesystems, and metric reporters in
the far future. So I’m actually aiming for
github.com/(apache/)flink-packages. But again this discussion is orthogonal
to the basic one.

PPS: Depending on the chosen approach, there may be synergies with the
recently approved flink-extended organization.

Re: [DISCUSS] Creating an external connector repository

Reply via email to