On 7 Aug 2019, at 13:14, Chesnay Schepler <ches...@apache.org> wrote:
Hello everyone,
The Flink project sees an ever-increasing amount of dev activity, both in terms
of reworked and new features.
This is of course an excellent situation to be in, but we are getting to a
point where the associate downsides are becoming increasingly troublesome.
The ever increasing build times, in addition to unstable tests, significantly
slow down the develoment process.
Additionally, pull requests for smaller features frequently slip through the
crasks as they are being buried under a mountain of other pull requests.
As a result I'd like to start a discussion on splitting the Flink repository.
In this mail I will outline the core idea, and what problems I currently
envision.
I'd specifically like to encourage those who were part of similar initiatives
in other projects to share the experiences and ideas.
General Idea
For starters, the idea is to create a new repository for "flink-connectors".
For the remainder of this mail, the current Flink repository is referred to as
"flink-main".
There are also other candidates that we could discuss in the future, like
flink-libraries (the next top-priority repo to ease flink-ml development),
metric reporters, filesystems and flink-formats.
Moving out flink-connectors provides the most benefits, as we straight away
save at-least an hour of testing time, and not being included in the binary
distribution simplifies a few things.
Problems to solve
To make this a reality there's a number of questions we have to discuss; some
in the short-term, others in the long-term.
1) Git history
We have to decide whether we want to rewrite the history of sub
repositories to only contain diffs/commits related to this part of
Flink, or whether we just fork from some commit in flink-main and
add a commit to the connector repo that "transforms" it from
flink-main to flink-connectors (i.e., remove everything unrelated to
connectors + update module structure etc.).
The latter option would have the advantage that our commit book
keeping in JIRA would still be correct, but it would create a
significant divide between the current and past state of the repository.
2) Maven
We should look into whether there's a way to share dependency/plugin
configurations and similar, so we don't have to keep them in sync
manually across multiple repositories.
A new parent Flink pom that all repositories define as their parent
could work; this would imply splicing out part of the current room
pom.xml.
3) Documentation
Splitting the repository realistically also implies splitting the
documentation source files (At the beginning we can get by with
having it still in flink-main).
We could just move the relevant files to the respective repository
(while maintaining the directory structure), and merge them when
building the docs.
We also have to look at how we can handle java-/scaladocs; e.g.
whether it is possible to aggregate them across projects.
4) CI (end-to-end tests)
The very basic question we have to answer is whether we want E2E
tests in the sub repositories. If so, we need to find a way to share
e2e-tooling.
5) Releases
We have to discuss how our release process will look like. This may
also have repercussions on how repositories may depend on each other
(SNAPSHOT vs LATEST). Note that this should be discussed for each
repo separately.
The current options I see are the following:
a) Single release
Release all repositories at once as a single product.
The source release would be a collection of repositories, like
flink/
|--flink-main/
|--flink-core/
|--flink-runtime/
...
|--flink-connectors/
...
|--flink-.../
...
This option requires a SNAPSHOT dependency between Flink
repositories, but it is pretty much how things work at the moment.
b) Synced releases
Similar to a), except that each repository gets their own source
release that they may released independent of other repositories.
For a given release cycle each repo would produce exactly one
release.
This option requires a SNAPSHOT dependency between Flink
repositories. Once any repositories has created an RC or
finished it's release, release-branches in other repos can
switch to that version.
This approach is a tad more flexible than a), but requires more
coordination between the repos.
c) Separate releases
Just like we handle flink-shaded; entirely separate release
cycles; some repositories may have more releases in a given time
period than others.
This option implies a LATEST dependency between Flink repositories.
Note that hybrid approaches would also make sense, like doing b) for
major versions and c) for bugfix releases.
For something like flink-libraries this question may also have
repercussions on how/whether they are bundled in the distribution;
options a)/b) would maintain the status-quo, c) and hybrid
approaches will likely necessitate the exclusion from the distribution.