It depends on how to define "split"; if you split by module (as we do currently) you have the same complexity as we have right now;
caching of artifacts and brittle definition of splits.

But there are other ways to split builds, for example into unit and integration tests; could also add end-to-end tests to this list. At that point we're basically talking about multiple parallel builds that are fully independent. Let's also remember that caching of the build artifact is only useful when the compile times are large enough to warrant it; if we only go with 2 splits in the grand scheme of things the caching wouldn't even be required. We added the caching to Travis since at 5+ builds (and the guarantee for this number to go up) the compilation time was a much larger factor.

As for the current split setup we have (as in by modules), it isn't just about faster feedback times; they can also be used to isolate components from each other. I  know that quite a few people appreciate the kafka/python module being in it's own split for example.

On 11/12/2019 16:44, Robert Metzger wrote:
Some comments on Chesnay's message:
- Changing the number of splits will not reduce the complexity.
- One can also use the Flink build machines by opening a PR to the
"flink-ci/flink" repo, no need to open crappy PRs :)
- On the number of builds being run: We currently use 4 out of 10 machines
offered by Alibaba, and we are not yet hitting any limits. In addition to
that, another big cloud provider has reached out to us, offering build
capacity.

But generally, I agree that solely relying on the build infrastructure of
Flink is not a good option. The free Azure builds should provide a
reasonable experience.


On Wed, Dec 11, 2019 at 3:22 PM Chesnay Schepler <ches...@apache.org> wrote:

Note that for B it's not strictly necessary to maintain the current
number of splits; 2 might already be enough to bring contributor builds
to a more reasonable level.

I don't think that a contributor build taking 3,5h is a viable option;
people will start disregarding their own instance and just open a PR
without having run the tests, which will naturally mean that PR quality
will drop. Committers probably will start working around this and push
branches into the flink repo for running tests; we have seen that in the
past and see this currently for e2e tests.

This will increase the number of builds being run on the Flink machines
by quite a bit, obviously affecting throughput and latency..

On 11/12/2019 14:59, Arvid Heise wrote:
Hi Robert,

thank you very much for raising this issue and improving the build
system.
For now, I'd like to stick to a lean solution (= option A).

While option B can greatly reduce build times, it also has the habit of
clogging up the build machines. Just some arbitrary numbers, but it
currently feels like B cuts down latency by half but also uses 10
machines
for 30 minutes, decreasing the overall throughput significantly. Thus,
when
many folks want to see their commits tested, resources quickly run out
and
this in turn significantly increases latency.
I'd like to have some more predictable build times and sacrifice some
latency for now.

It would be interesting to see if we could rearrange the project
execution
in Maven, such that fast projects are executed first. E2E tests should be
executed last, which they are somewhat, because of the project
dependencies.
Of course, I'm very interested to improve the overall build experience by
exploring other options to Maven.

Best,

Arvid

On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <rmetz...@apache.org>
wrote:
Hey devs,

I need your opinion on something: As part of our migration from Travis
to
Azure, I'm revisiting the build system of Flink. I currently see two
different ways of proceeding, and I would like to know your opinion on
the
two options.

A) We build and test Flink in one "mvn clean verify" call on the CI
system.
B) We migrate the two staged build of one compile and N test jobs to
Azure.
Option A) is what we are currently running as part of testing the
Azure-based system.

Pro/Cons for A)
+ for "apache/flink" pushes and pull requests, the big testing machines
need 1:30 hours to complete (this might go up for a few minutes because
the
python tests, and some auxiliary tests are not executed yet)
+ Our build will be easier to maintain and understand, because we rely
on
fewer scripts
- builds on Flink forks, using the free Azure plan currently take 3:30
hours to complete.

Pro/Cons for B)
+ builds on Flink forks using the free Azure plan take 1:20 hours,
+ Builds take 1:20 hours on the big testing machines
- maintenance and complexity of the build scripts
- the build times are a lot less predictable, because they depend on the
availability of workers. For the free plan builds, they are currently
fast,
because the test stage has 10 jobs, and Azure offers 10 parallel
workers.
We currently only have a total of 8 big machines, so there will always
be
some queueing. In practice, for the "apache/flink" repo, build times
will
be less favorable, because of the scheduling.


In my opinion, the question is mostly: Are you okay to wait 3.5 hours
for a
build to finish on your private CI, in favor of a less complex build
system?
Ideally, we'll be able to reduce these 3.5 hours by using a more modern
build tool ("gradle") in the future.

I'm happy to hear your thoughts!

Best,
Robert



Reply via email to