Re: [DISCUSS] Reducing build times

Chesnay Schepler Fri, 16 Aug 2019 01:44:51 -0700

There appears to be a general agreement that 1) should be looked into;I've setup a branch with fork reuse being enabled for all tests; willreport back the results.


On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,
improving our build times is a hot topic at the moment so let'sdiscuss the different ways how they could be reduced.
       Current state:

First up, let's look at some numbers:
1 full build currently consumes 5h of build time total ("total time"),and in the ideal case takes about 1h20m ("run time") to complete fromstart to finish. The run time may fluctuate of course depending on thecurrent Travis load. This applies both to builds on the Apache andflink-ci Travis.
At the time of writing, the current queue time for PR jobs (reminder:running on flink-ci) is about 30 minutes (which basically means thatwe are processing builds at the rate that they come in), however weare in an admittedly quiet period right now.2 weeks ago the queue times on flink-ci peaked at around 5-6h aseveryone was scrambling to get their changes merged in time for thefeature freeze.
(Note: Recently optimizations where added to ci-bot where pendingbuilds are canceled if a new commit was pushed to the PR or the PR wasclosed, which should prove especially useful during the rush hours wesee before feature-freezes.)
       Past approaches
Over the years we have done rather few things to improve thissituation (hence our current predicament).
Beyond the sporadic speedup of some tests, the only notable reductionin total build times was the introduction of cron jobs, whichconsolidated the per-commit matrix from 4 configurations (differentscala/hadoop versions) to 1.
The separation into multiple build profiles was only a work-around forthe 50m limit on Travis. Running tests in parallel has the obviouspotential of reducing run time, but we're currently hitting a hardlimit since a few modules (flink-tests, flink-runtime,flink-table-planner-blink) are so loaded with tests that they nearlyconsume an entire profile by themselves (and thus no further splittingis possible).
The rework that introduced stages, at the time of introduction, didalso not provide a speed up, although this changed slightly once moreprofiles were added and some optimizations to the caching have been made.
Very recently we modified the surefire-plugin configuration forflink-table-planner-blink to reuse JVM forks for IT cases, providing asignificant speedup (18 minutes!). So far we have not seen anynegative consequences.
       Suggestions
This is a list of /all /suggestions for reducing run/total times thatI have seen recently (in other words, they aren't necessarily mine normay I agree with all of them).
1. Enable JVM reuse for IT cases in more modules.
     * We've seen significant speedups in the blink planner, and this
       should be applicable for all modules. However, I presume there's
       a reason why we disabled JVM reuse (information on this would be
       appreciated)
2. Custom differential build scripts
     * Setup custom scripts for determining which modules might be
       affected by change, and manipulate the splits accordingly. This
       approach is conceptually quite straight-forward, but has limits
       since it has to be pessimistic; i.e. a change in flink-core
       _must_ result in testing all modules.
3. Only run smoke tests when PR is opened, run heavy tests on demand.
     * With the introduction of the ci-bot we now have significantly
       more options on how to handle PR builds. One option could be to
       only run basic tests when the PR is created (which may be only
       modified modules, or all unit tests, or another low-cost
       scheme), and then have a committer trigger other builds (full
       test run, e2e tests, etc...) on demand.
4. Move more tests into cron builds
     * The budget version of 3); move certain tests that are either
       expensive (like some runtime tests that take minutes) or in
       rarely modified modules (like gelly) into cron jobs.
5. Gradle
     * Gradle was brought up a few times for it's built-in support for
       differential builds; basically providing 2) without the overhead
       of maintaining additional scripts.
     * To date no PoC was provided that shows it working in our CI
       environment (i.e., handling splits & caching etc).
     * This is the most disruptive change by a fair margin, as it would
       affect the entire project, developers and potentially users (f
       they build from source).
6. CI service
     * Our current artifact caching setup on Travis is basically a
       hack; we're basically abusing the Travis cache, which is meant
       for long-term caching, to ship build artifacts across jobs. It's
       brittle at times due to timing/visibility issues and on branches
       the cleanup processes can interfere with running builds. It is
       also not as effective as it could be.
     * There are CI services that provide build artifact caching out of
       the box, which could be useful for us.
     * To date, no PoC for using another CI service has been provided.

Re: [DISCUSS] Reducing build times

Reply via email to