Re: [DISCUSS] Reducing build times

Chesnay Schepler Wed, 04 Sep 2019 04:44:56 -0700

e2e tests on Travis add another 4-5 hours, but we never optimized theseto make use of the cached Flink artifact.


On 04/09/2019 13:26, Till Rohrmann wrote:

How long do we need to run all e2e tests? They are not included in the 3,5
hours I assume.


Cheers,
Till

On Wed, Sep 4, 2019 at 11:59 AM Robert Metzger <[email protected]> wrote:

Yes, we can ensure the same (or better) experience for contributors.

On the powerful machines, builds finish in 1.5 hours (without any caching
enabled).

Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a
build for open source projects. Flink needs 3.5 hours on that infra (not
parallelized at all, no caching). These free machines are very similar to
those of Travis, so I expect no build time regressions, if we set it up
similarly.


On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler <[email protected]>
wrote:

Will using more powerful for the project make it more difficult to
ensure that contributor builds are still running in a reasonable time?

As an example of this happening on Travis, contributors currently cannot
run all e2e tests since they timeout, but on apache we have a larger
timeout.

On 03/09/2019 18:57, Robert Metzger wrote:

Hi all,

I wanted to give a short update on this:
- Arvid, Aljoscha and I have started working on a Gradle PoC, currently
working on making all modules compile and test with Gradle. We've also
identified some problematic areas (shading being the most obvious one)
which we will analyse as part of the PoC.
The goal is to see how much Gradle helps to parallelise our build, and

to

avoid duplicate work (incremental builds).

- I am working on setting up a Flink testing infrastructure based on

Azure

Pipelines, using more powerful hardware. Alibaba kindly provided me

with

two 32 core machines (temporarily), and another company reached out to
privately, looking into options for cheap, fast machines :)
If nobody in the community disagrees, I am going to set up Azure

Pipelines

with our apache/flink GitHub as a build infrastructure that exists next

to

Flinkbot and flink-ci. I would like to make sure that Azure Pipelines

is

equally or even more reliable than Travis, and I want to see what the
required maintenance work is.
On top of that, Azure Pipelines is a very feature-rich tool with a lot

of

nice options for us to improve the build experience (statistics about

tests

(flaky tests etc.), nice docker support, plenty of free build resources

for

open source projects, ...)

Best,
Robert





On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[email protected]>

wrote:

Hi all,

I have summarized all arguments mentioned so far + some additional
research into a Wiki page here:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279

I'm happy to hear further comments on my summary! I'm pretty sure we

can

find more pro's and con's for the different options.

My opinion after looking at the options:

     - Flink relies on an outdated build tool (Maven), while a good
     alternative is well-established (gradle), and will likely provide

much

     better CI and local build experience through incremental build and

cached

     intermediates.
     Scripting around Maven, or splitting modules / test execution /
     repositories won't solve this problem. We should rather spend the

effort in

     migrating to a modern build tool which will provide us benefits in

the long

     run.
     - Flink relies on a fairly slow build service (Travis CI), while
     simply putting more money onto the problem could cut the build

time

at

     least in half.
     We should consider using a build service that provides bigger

machines

     to solve our build time problem.

My opinion is based on many assumptions (gradle is actually as fast as
promised (haven't used it before), we can build Flink with gradle, we

find

sponsors for bigger build machines) that we need to test first through

PoCs.

Best,
Robert




On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <

[email protected]>

wrote:

I did a quick test: a normal "mvn clean install -DskipTests
-Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my

machine

takes about 14 minutes. After removing all mentions of

maven-shade-plugin

the build time goes down to roughly 11.5 minutes. (Obviously the

resulting

Flink won’t work, because some expected stuff is not packaged and

most

of

the end-to-end tests use the shade plugin to package the jars for

testing.

Aljoscha

On 18. Aug 2019, at 19:52, Robert Metzger <[email protected]>

wrote:

Hi all,

I wanted to understand the impact of the hardware we are using for

running

our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory

[1].

They are using Google Cloud Compute Engine *n1-standard-2*

instances.

Running a full "mvn clean verify" takes *03:32 h* on such a machine

type.

Running the same workload on a 32 virtual cores, 64 gb machine,

takes

*1:21

h*.

What is interesting are the per-module build time differences.
Modules which are parallelizing tests well greatly benefit from the
additional cores:
"flink-tests" 36:51 min vs 4:33 min
"flink-runtime" 23:41 min vs 3:47 min
"flink-table-planner" 15:54 min vs 3:13 min

On the other hand, we have modules which are not parallel at all:
"flink-connector-kafka": 16:32 min vs 15:19 min
"flink-connector-kafka-0.11": 9:52 min vs 7:46 min
Also, the checkstyle plugin is not scaling at all.

Chesnay reported some significant speedups by reusing forks.
I don't know how much effort it would be to make the Kafka tests
parallelizable. In total, they currently use 30 minutes on the big

machine

(while 31 CPUs are idling :) )

Let me know what you think about these results. If the community is
generally interested in further investigating into that direction, I

could

look into software to orchestrate this, as well as sponsors for such

an

infrastructure.

[1] https://docs.travis-ci.com/user/reference/overview/


On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <

[email protected]>

wrote:

@Aljoscha Shading takes a few minutes for a full build; you can see

this

quite easily by looking at the compile step in the misc profile
<https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules

that

longer than a fraction of a section are usually caused by shading

lots

of classes. Note that I cannot tell you how much of this is spent

on

relocations, and how much on writing the jar.

Personally, I'd very much like us to move all shading to

flink-shaded;

this would finally allows us to use newer maven versions without

needing

cumbersome workarounds for flink-dist. However, this isn't a

trivial

affair in some cases; IIRC calcite could be difficult to handle.

On another note, this would also simplify switching the main repo

to

another build system, since you would no longer had to deal with
relocations, just packaging + merging NOTICE files.

@BowenLi I disagree, flink-shaded does not include any tests,  API
compatibility checks, checkstyle, layered shading (e.g.,

flink-runtime

and flink-dist, where both relocate dependencies and one is bundled

by

the other), and, most importantly, CI (and really, without CI being
covered in a PoC there's nothing to discuss).

On 16/08/2019 15:13, Aljoscha Krettek wrote:

Speaking of flink-shaded, do we have any idea what the impact of

shading

is on the build time? We could get rid of shading completely in the

Flink

main repository by moving everything that we shade to flink-shaded.

Aljoscha

On 16. Aug 2019, at 14:58, Bowen Li <[email protected]> wrote:

+1 to Till's points on #2 and #5, especially the potential

non-disruptive,

gradual migration approach if we decide to go that route.

To add on, I want to point it out that we can actually start with
flink-shaded project [1] which is a perfect candidate for PoC.

It's

of

much

smaller size, totally isolated from and not interfered with flink

project

[2], and it actually covers most of our practical feature

requirements

for

a build tool - all making it an ideal experimental field.

[1] https://github.com/apache/flink-shaded
[2] https://github.com/apache/flink


On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <

[email protected]>

wrote:

For the sake of keeping the discussion focused and not

cluttering

the

discussion thread I would suggest to split the detailed

reporting

for

reusing JVMs to a separate thread and cross linking it from

here.

Cheers,
Till

On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <

[email protected]>

wrote:

Update:

TL;DR: table-planner is a good candidate for enabling fork

reuse

right

away, while flink-tests has the potential for huge savings, but

we

have

to figure out some issues first.


Build link:

https://travis-ci.org/zentol/flink/builds/572659220

4/8 profiles failed.

No speedup in libraries, python, blink_planner, 7 minutes saved

in

libraries (table-planner).

The kafka and connectors profiles both fail in kafka tests due

to

producer leaks, and no speed up could be confirmed so far:

java.lang.AssertionError: Detected producer leak. Thread name:
kafka-producer-network-thread | producer-239
         at org.junit.Assert.fail(Assert.java:88)
         at

org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)

at

org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)

The tests profile failed due to various errors in migration

tests:

junit.framework.AssertionFailedError: Did not see the expected

accumulator

results within time limit.
         at

org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)

*However*, a normal tests run takes 40 minutes, while this one

above

failed after 19 minutes and is only missing the migration tests

(which

currently need 6-7 minutes). So we could save somewhere between

to

minutes here.


Finally, the misc profiles fails in YARN:

java.lang.AssertionError
         at

org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)

No significant speedup could be observed in other modules; for
flink-yarn-tests we can maybe get a minute or 2 out of it.

On 16/08/2019 10:43, Chesnay Schepler wrote:

There appears to be a general agreement that 1) should be

looked

into;

I've setup a branch with fork reuse being enabled for all

tests;

will

report back the results.

On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so

let's

discuss the different ways how they could be reduced.


        Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total

("total

time"), and in the ideal case takes about 1h20m ("run time")

to

complete from start to finish. The run time may fluctuate of

course

depending on the current Travis load. This applies both to

builds on

the Apache and flink-ci Travis.

At the time of writing, the current queue time for PR jobs

(reminder:

running on flink-ci) is about 30 minutes (which basically

means

that

we are processing builds at the rate that they come in),

however

we

are in an admittedly quiet period right now.
2 weeks ago the queue times on flink-ci peaked at around 5-6h

as

everyone was scrambling to get their changes merged in time

for

the

feature freeze.

(Note: Recently optimizations where added to ci-bot where

pending

builds are canceled if a new commit was pushed to the PR or

the

PR

was closed, which should prove especially useful during the

rush

hours we see before feature-freezes.)


        Past approaches

Over the years we have done rather few things to improve this
situation (hence our current predicament).

Beyond the sporadic speedup of some tests, the only notable

reduction

in total build times was the introduction of cron jobs, which
consolidated the per-commit matrix from 4 configurations

(different

scala/hadoop versions) to 1.

The separation into multiple build profiles was only a

work-around

for the 50m limit on Travis. Running tests in parallel has

the

obvious potential of reducing run time, but we're currently

hitting

hard limit since a few modules (flink-tests, flink-runtime,
flink-table-planner-blink) are so loaded with tests that they

nearly

consume an entire profile by themselves (and thus no further
splitting is possible).

The rework that introduced stages, at the time of

introduction,

did

also not provide a speed up, although this changed slightly

once

more

profiles were added and some optimizations to the caching

have

been

made.

Very recently we modified the surefire-plugin configuration

for

flink-table-planner-blink to reuse JVM forks for IT cases,

providing

a significant speedup (18 minutes!). So far we have not seen

any

negative consequences.


        Suggestions

This is a list of /all /suggestions for reducing run/total

times

that

I have seen recently (in other words, they aren't necessarily

mine

nor may I agree with all of them).

1. Enable JVM reuse for IT cases in more modules.
      * We've seen significant speedups in the blink planner,

and

this

        should be applicable for all modules. However, I

presume

there's

        a reason why we disabled JVM reuse (information on

this

would

be

        appreciated)
2. Custom differential build scripts
      * Setup custom scripts for determining which modules

might be

        affected by change, and manipulate the splits

accordingly.

This

        approach is conceptually quite straight-forward, but

has

limits

        since it has to be pessimistic; i.e. a change in

flink-core

        _must_ result in testing all modules.
3. Only run smoke tests when PR is opened, run heavy tests on

demand.

      * With the introduction of the ci-bot we now have

significantly

        more options on how to handle PR builds. One option

could

be

to

        only run basic tests when the PR is created (which may

be

only

        modified modules, or all unit tests, or another

low-cost

        scheme), and then have a committer trigger other

builds

(full

        test run, e2e tests, etc...) on demand.
4. Move more tests into cron builds
      * The budget version of 3); move certain tests that are

either

        expensive (like some runtime tests that take minutes)

or in

        rarely modified modules (like gelly) into cron jobs.
5. Gradle
      * Gradle was brought up a few times for it's built-in

support

for

        differential builds; basically providing 2) without

the

overhead

        of maintaining additional scripts.
      * To date no PoC was provided that shows it working in

our CI

        environment (i.e., handling splits & caching etc).
      * This is the most disruptive change by a fair margin,

as

it

would

        affect the entire project, developers and potentially

users

(f

        they build from source).
6. CI service
      * Our current artifact caching setup on Travis is

basically a

        hack; we're basically abusing the Travis cache, which

is

meant

        for long-term caching, to ship build artifacts across

jobs.

It's

        brittle at times due to timing/visibility issues and

on

branches

        the cleanup processes can interfere with running

builds. It

is

        also not as effective as it could be.
      * There are CI services that provide build artifact

caching

out

of

        the box, which could be useful for us.
      * To date, no PoC for using another CI service has been

provided.

Re: [DISCUSS] Reducing build times

Reply via email to