Re: [DISCUSS] Need feedback on Azure-based build system

Chesnay Schepler Fri, 13 Dec 2019 05:41:43 -0800

It depends on how to define "split"; if you split by module (as we docurrently) you have the same complexity as we have right now;

caching of artifacts and brittle definition of splits.

But there are other ways to split builds, for example into unit andintegration tests; could also add end-to-end tests to this list.At that point we're basically talking about multiple parallel buildsthat are fully independent.Let's also remember that caching of the build artifact is only usefulwhen the compile times are large enough to warrant it;if we only go with 2 splits in the grand scheme of things the cachingwouldn't even be required.We added the caching to Travis since at 5+ builds (and the guarantee forthis number to go up) the compilation time was a much larger factor.

As for the current split setup we have (as in by modules), it isn't justabout faster feedback times; they can also be used to isolate componentsfrom each other.I know that quite a few people appreciate the kafka/python module beingin it's own split for example.


On 11/12/2019 16:44, Robert Metzger wrote:

Some comments on Chesnay's message:
- Changing the number of splits will not reduce the complexity.
- One can also use the Flink build machines by opening a PR to the
"flink-ci/flink" repo, no need to open crappy PRs :)
- On the number of builds being run: We currently use 4 out of 10 machines
offered by Alibaba, and we are not yet hitting any limits. In addition to
that, another big cloud provider has reached out to us, offering build
capacity.

But generally, I agree that solely relying on the build infrastructure of
Flink is not a good option. The free Azure builds should provide a
reasonable experience.


On Wed, Dec 11, 2019 at 3:22 PM Chesnay Schepler <ches...@apache.org> wrote:

Note that for B it's not strictly necessary to maintain the current
number of splits; 2 might already be enough to bring contributor builds
to a more reasonable level.

I don't think that a contributor build taking 3,5h is a viable option;
people will start disregarding their own instance and just open a PR
without having run the tests, which will naturally mean that PR quality
will drop. Committers probably will start working around this and push
branches into the flink repo for running tests; we have seen that in the
past and see this currently for e2e tests.

This will increase the number of builds being run on the Flink machines
by quite a bit, obviously affecting throughput and latency..

On 11/12/2019 14:59, Arvid Heise wrote:

Hi Robert,

thank you very much for raising this issue and improving the build

system.

For now, I'd like to stick to a lean solution (= option A).

While option B can greatly reduce build times, it also has the habit of
clogging up the build machines. Just some arbitrary numbers, but it
currently feels like B cuts down latency by half but also uses 10

machines

for 30 minutes, decreasing the overall throughput significantly. Thus,

when

many folks want to see their commits tested, resources quickly run out

and

this in turn significantly increases latency.
I'd like to have some more predictable build times and sacrifice some
latency for now.

It would be interesting to see if we could rearrange the project

execution

in Maven, such that fast projects are executed first. E2E tests should be
executed last, which they are somewhat, because of the project

dependencies.

Of course, I'm very interested to improve the overall build experience by
exploring other options to Maven.

Best,

Arvid

On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <rmetz...@apache.org>

wrote:

Hey devs,

I need your opinion on something: As part of our migration from Travis

to

Azure, I'm revisiting the build system of Flink. I currently see two
different ways of proceeding, and I would like to know your opinion on

the

two options.

A) We build and test Flink in one "mvn clean verify" call on the CI

system.

B) We migrate the two staged build of one compile and N test jobs to

Azure.

Option A) is what we are currently running as part of testing the
Azure-based system.

Pro/Cons for A)
+ for "apache/flink" pushes and pull requests, the big testing machines
need 1:30 hours to complete (this might go up for a few minutes because

the

python tests, and some auxiliary tests are not executed yet)
+ Our build will be easier to maintain and understand, because we rely

on

fewer scripts
- builds on Flink forks, using the free Azure plan currently take 3:30
hours to complete.

Pro/Cons for B)
+ builds on Flink forks using the free Azure plan take 1:20 hours,
+ Builds take 1:20 hours on the big testing machines
- maintenance and complexity of the build scripts
- the build times are a lot less predictable, because they depend on the
availability of workers. For the free plan builds, they are currently

fast,

because the test stage has 10 jobs, and Azure offers 10 parallel

workers.

We currently only have a total of 8 big machines, so there will always

be

some queueing. In practice, for the "apache/flink" repo, build times

will

be less favorable, because of the scheduling.


In my opinion, the question is mostly: Are you okay to wait 3.5 hours

for a

build to finish on your private CI, in favor of a less complex build
system?
Ideally, we'll be able to reduce these 3.5 hours by using a more modern
build tool ("gradle") in the future.

I'm happy to hear your thoughts!

Best,
Robert

Re: [DISCUSS] Need feedback on Azure-based build system

Reply via email to