Thanks Chesnay for bringing up this discussion and sharing those thoughts to speed up the building process.
I'd +1 for option 2 and 3. We can benefits a lot from Option 2. Developing table, connectors, libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to run. PRs for those modules take up more than half of all the PRs in my observation. Option 3 can be a supplementary to option 2 that if the PR is modifying fundamental modules like flink-core or flink-runtime. It can even be a switch of the tests scope(basic/full) of a PR, so that committers do not need to trigger it multiple times. With it we can postpone the testing of IT cases or connectors before the PR reaches a stable state. Thanks, Zhu Zhu Chesnay Schepler <ches...@apache.org> 于2019年8月15日周四 下午3:38写道: > Hello everyone, > > improving our build times is a hot topic at the moment so let's discuss > the different ways how they could be reduced. > > > Current state: > > First up, let's look at some numbers: > > 1 full build currently consumes 5h of build time total ("total time"), > and in the ideal case takes about 1h20m ("run time") to complete from > start to finish. The run time may fluctuate of course depending on the > current Travis load. This applies both to builds on the Apache and > flink-ci Travis. > > At the time of writing, the current queue time for PR jobs (reminder: > running on flink-ci) is about 30 minutes (which basically means that we > are processing builds at the rate that they come in), however we are in > an admittedly quiet period right now. > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as > everyone was scrambling to get their changes merged in time for the > feature freeze. > > (Note: Recently optimizations where added to ci-bot where pending builds > are canceled if a new commit was pushed to the PR or the PR was closed, > which should prove especially useful during the rush hours we see before > feature-freezes.) > > > Past approaches > > Over the years we have done rather few things to improve this situation > (hence our current predicament). > > Beyond the sporadic speedup of some tests, the only notable reduction in > total build times was the introduction of cron jobs, which consolidated > the per-commit matrix from 4 configurations (different scala/hadoop > versions) to 1. > > The separation into multiple build profiles was only a work-around for > the 50m limit on Travis. Running tests in parallel has the obvious > potential of reducing run time, but we're currently hitting a hard limit > since a few modules (flink-tests, flink-runtime, > flink-table-planner-blink) are so loaded with tests that they nearly > consume an entire profile by themselves (and thus no further splitting > is possible). > > The rework that introduced stages, at the time of introduction, did also > not provide a speed up, although this changed slightly once more > profiles were added and some optimizations to the caching have been made. > > Very recently we modified the surefire-plugin configuration for > flink-table-planner-blink to reuse JVM forks for IT cases, providing a > significant speedup (18 minutes!). So far we have not seen any negative > consequences. > > > Suggestions > > This is a list of /all /suggestions for reducing run/total times that I > have seen recently (in other words, they aren't necessarily mine nor may > I agree with all of them). > > 1. Enable JVM reuse for IT cases in more modules. > * We've seen significant speedups in the blink planner, and this > should be applicable for all modules. However, I presume there's > a reason why we disabled JVM reuse (information on this would be > appreciated) > 2. Custom differential build scripts > * Setup custom scripts for determining which modules might be > affected by change, and manipulate the splits accordingly. This > approach is conceptually quite straight-forward, but has limits > since it has to be pessimistic; i.e. a change in flink-core > _must_ result in testing all modules. > 3. Only run smoke tests when PR is opened, run heavy tests on demand. > * With the introduction of the ci-bot we now have significantly > more options on how to handle PR builds. One option could be to > only run basic tests when the PR is created (which may be only > modified modules, or all unit tests, or another low-cost > scheme), and then have a committer trigger other builds (full > test run, e2e tests, etc...) on demand. > 4. Move more tests into cron builds > * The budget version of 3); move certain tests that are either > expensive (like some runtime tests that take minutes) or in > rarely modified modules (like gelly) into cron jobs. > 5. Gradle > * Gradle was brought up a few times for it's built-in support for > differential builds; basically providing 2) without the overhead > of maintaining additional scripts. > * To date no PoC was provided that shows it working in our CI > environment (i.e., handling splits & caching etc). > * This is the most disruptive change by a fair margin, as it would > affect the entire project, developers and potentially users (f > they build from source). > 6. CI service > * Our current artifact caching setup on Travis is basically a > hack; we're basically abusing the Travis cache, which is meant > for long-term caching, to ship build artifacts across jobs. It's > brittle at times due to timing/visibility issues and on branches > the cleanup processes can interfere with running builds. It is > also not as effective as it could be. > * There are CI services that provide build artifact caching out of > the box, which could be useful for us. > * To date, no PoC for using another CI service has been provided. > >