Note that for B it's not strictly necessary to maintain the current
number of splits; 2 might already be enough to bring contributor builds
to a more reasonable level.
I don't think that a contributor build taking 3,5h is a viable option;
people will start disregarding their own instance and just open a PR
without having run the tests, which will naturally mean that PR quality
will drop. Committers probably will start working around this and push
branches into the flink repo for running tests; we have seen that in the
past and see this currently for e2e tests.
This will increase the number of builds being run on the Flink machines
by quite a bit, obviously affecting throughput and latency..
On 11/12/2019 14:59, Arvid Heise wrote:
Hi Robert,
thank you very much for raising this issue and improving the build system.
For now, I'd like to stick to a lean solution (= option A).
While option B can greatly reduce build times, it also has the habit of
clogging up the build machines. Just some arbitrary numbers, but it
currently feels like B cuts down latency by half but also uses 10 machines
for 30 minutes, decreasing the overall throughput significantly. Thus, when
many folks want to see their commits tested, resources quickly run out and
this in turn significantly increases latency.
I'd like to have some more predictable build times and sacrifice some
latency for now.
It would be interesting to see if we could rearrange the project execution
in Maven, such that fast projects are executed first. E2E tests should be
executed last, which they are somewhat, because of the project dependencies.
Of course, I'm very interested to improve the overall build experience by
exploring other options to Maven.
Best,
Arvid
On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <rmetz...@apache.org> wrote:
Hey devs,
I need your opinion on something: As part of our migration from Travis to
Azure, I'm revisiting the build system of Flink. I currently see two
different ways of proceeding, and I would like to know your opinion on the
two options.
A) We build and test Flink in one "mvn clean verify" call on the CI system.
B) We migrate the two staged build of one compile and N test jobs to Azure.
Option A) is what we are currently running as part of testing the
Azure-based system.
Pro/Cons for A)
+ for "apache/flink" pushes and pull requests, the big testing machines
need 1:30 hours to complete (this might go up for a few minutes because the
python tests, and some auxiliary tests are not executed yet)
+ Our build will be easier to maintain and understand, because we rely on
fewer scripts
- builds on Flink forks, using the free Azure plan currently take 3:30
hours to complete.
Pro/Cons for B)
+ builds on Flink forks using the free Azure plan take 1:20 hours,
+ Builds take 1:20 hours on the big testing machines
- maintenance and complexity of the build scripts
- the build times are a lot less predictable, because they depend on the
availability of workers. For the free plan builds, they are currently fast,
because the test stage has 10 jobs, and Azure offers 10 parallel workers.
We currently only have a total of 8 big machines, so there will always be
some queueing. In practice, for the "apache/flink" repo, build times will
be less favorable, because of the scheduling.
In my opinion, the question is mostly: Are you okay to wait 3.5 hours for a
build to finish on your private CI, in favor of a less complex build
system?
Ideally, we'll be able to reduce these 3.5 hours by using a more modern
build tool ("gradle") in the future.
I'm happy to hear your thoughts!
Best,
Robert