Thanks Chesnay for starting this discussion. +1 for #1, it might be the easiest way to get a significant speedup. If the only reason is for isolation. I think we can fix the static fields or global state used in Flink if possible.
+1 for #2, and thanks Aleksey for the prototype. I think it's a good approach which doesn't introduce too much things to maintain. +1 for #3(run CRON or e2e tests on demand). We have this requirement when reviewing some pull requests, because we don't sure whether it will broken some specific e2e test. Currently, we have to run it locally by building the whole project. Or enable CRON jobs for the pushed branch in contributor's own travis. Besides that, I think FLINK-11464[1] is also a good way to cache distributions to save a lot of download time. Best, Jark [1]: https://issues.apache.org/jira/browse/FLINK-11464 On Thu, 15 Aug 2019 at 21:47, Aleksey Pak <alek...@ververica.com> wrote: > Hi all! > > Thanks for starting this discussion. > > I'd like to also add my 2 cents: > > +1 for #2, differential build scripts. > I've worked on the approach. And with it, I think it's possible to reduce > total build time with relatively low effort, without enforcing any new > build tool and low maintenance cost. > > You can check a proposed change (for the old CI setup, when Flink PRs were > running in Apache common CI pool) here: > https://github.com/apache/flink/pull/9065 > In the proposed change, the dependency check is not heavily hardcoded and > just uses maven's results for dependency graph analysis. > > > This approach is conceptually quite straight-forward, but has limits > since it has to be pessimistic; > i.e. a change in flink-core _must_ result > in testing all modules. > > Agree, in Flink case, there are some core modules that would trigger whole > tests run with such approach. For developers who modify such components, > the build time would be the longest. But this approach should really help > for developers who touch more-or-less independent modules. > > Even for core modules, it's possible to create "abstraction" barriers by > changing dependency graph. For example, it can look like: flink-core-api > <-- flink-core, flink-core-api <-- flink-connectors. > In that case, only change in flink-core-api would trigger whole tests run. > > +1 for #3, separating PR CI runs to different stages. > Imo, it may require more change to current CI setup, compared to #2 and > better it should not be silly. Best, if it integrates with the Flink bot > and triggers some follow up build steps only when some prerequisites are > done. > > +1 for #4, to move some tests into cron runs. > But imo, this does not scale well, it applies only to a small subset of > tests. > > +1 for #6, to use other CI service(s). > More specifically, GitHub gives build actions for free that can be used to > offload some build steps/PR checks. It can help to move out some PR checks > from the main CI build (for example: documentation builds, license checks, > code formatting checks). > > Regards, > Aleksey > > On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann <trohrm...@apache.org> > wrote: > > > Thanks for starting this discussion Chesnay. I think it has become > obvious > > to the Flink community that with the existing build setup we cannot > really > > deliver fast build times which are essential for fast iteration cycles > and > > high developer productivity. The reasons for this situation are manifold > > but it is definitely affected by Flink's project growth, not always > optimal > > tests and the inflexibility that everything needs to be built. Hence, I > > consider the reduction of build times crucial for the project's health > and > > future growth. > > > > Without necessarily voicing a strong preference for any of the presented > > suggestions, I wanted to comment on each of them: > > > > 1. This sounds promising. Could the reason why we don't reuse JVMs date > > back to the time when we still had a lot of static fields in Flink which > > made it hard to reuse JVMs and the potentially mutated global state? > > > > 2. Building hand-crafted solutions around a build system in order to > > compensate for its limitations which other build systems support out of > the > > box sounds like the not invented here syndrome to me. Reinventing the > wheel > > has historically proven to be usually not the best solution and it often > > comes with a high maintenance price tag. Moreover, it would add just > > another layer of complexity around our existing build system. I think the > > current state where we have the maven setup in pom files and for Travis > > multiple bash scripts specializing the builds to make it fit the time > limit > > is already not very transparent/easy to understand. > > > > 3. I could see this work but it also requires a very good understanding > of > > Flink of every committer because the committer needs to know which tests > > would be good to run additionally. > > > > 4. I would be against this option solely to decrease our build time. My > > observation is that the community does not monitor the health of the cron > > jobs well enough. In the past the cron jobs have been unstable for as > long > > as a complete release cycle. Moreover, I've seen that PRs were merged > which > > passed Travis but broke the cron jobs. Consequently, I fear that this > > option would deteriorate Flink's stability. > > > > 5. I would rephrase this point into changing the build system. Gradle > could > > be one candidate but there are also other build systems out there like > > Bazel. Changing the build system would indeed be a major endeavour but I > > could see the long term benefits of such a change (similar to having a > > consistent and enforced code style) in particular if the build system > > supports the functionality which we would otherwise build & maintain on > our > > own. I think there would be ways to make the transition not as disruptive > > as described. For example, one could keep the Maven build and the new > build > > side by side until one is confident enough that the new build produces > the > > same output as the Maven build. Maybe it would also be possible to > migrate > > individual modules starting from the leaves. However, I admit that > changing > > the build system will affect every Flink developer because she needs to > > learn & understand it. > > > > 6. I would like to learn about other people's experience with different > CI > > systems. Travis worked okish for Flink so far but we see sometimes > problems > > with its caching mechanism as Chesnay stated. I think that this topic is > > actually orthogonal to the other suggestions. > > > > My gut feeling is that not a single suggestion will be our solution but a > > combination of them. > > > > Cheers, > > Till > > > > On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu <reed...@gmail.com> wrote: > > > > > Thanks Chesnay for bringing up this discussion and sharing those > thoughts > > > to speed up the building process. > > > > > > I'd +1 for option 2 and 3. > > > > > > We can benefits a lot from Option 2. Developing table, connectors, > > > libraries, docs modules would result in much fewer tests(1/3 to 1/tens) > > to > > > run. > > > PRs for those modules take up more than half of all the PRs in my > > > observation. > > > > > > Option 3 can be a supplementary to option 2 that if the PR is modifying > > > fundamental modules like flink-core or flink-runtime. > > > It can even be a switch of the tests scope(basic/full) of a PR, so that > > > committers do not need to trigger it multiple times. > > > With it we can postpone the testing of IT cases or connectors before > the > > PR > > > reaches a stable state. > > > > > > Thanks, > > > Zhu Zhu > > > > > > Chesnay Schepler <ches...@apache.org> 于2019年8月15日周四 下午3:38写道: > > > > > > > Hello everyone, > > > > > > > > improving our build times is a hot topic at the moment so let's > discuss > > > > the different ways how they could be reduced. > > > > > > > > > > > > Current state: > > > > > > > > First up, let's look at some numbers: > > > > > > > > 1 full build currently consumes 5h of build time total ("total > time"), > > > > and in the ideal case takes about 1h20m ("run time") to complete from > > > > start to finish. The run time may fluctuate of course depending on > the > > > > current Travis load. This applies both to builds on the Apache and > > > > flink-ci Travis. > > > > > > > > At the time of writing, the current queue time for PR jobs (reminder: > > > > running on flink-ci) is about 30 minutes (which basically means that > we > > > > are processing builds at the rate that they come in), however we are > in > > > > an admittedly quiet period right now. > > > > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as > > > > everyone was scrambling to get their changes merged in time for the > > > > feature freeze. > > > > > > > > (Note: Recently optimizations where added to ci-bot where pending > > builds > > > > are canceled if a new commit was pushed to the PR or the PR was > closed, > > > > which should prove especially useful during the rush hours we see > > before > > > > feature-freezes.) > > > > > > > > > > > > Past approaches > > > > > > > > Over the years we have done rather few things to improve this > situation > > > > (hence our current predicament). > > > > > > > > Beyond the sporadic speedup of some tests, the only notable reduction > > in > > > > total build times was the introduction of cron jobs, which > consolidated > > > > the per-commit matrix from 4 configurations (different scala/hadoop > > > > versions) to 1. > > > > > > > > The separation into multiple build profiles was only a work-around > for > > > > the 50m limit on Travis. Running tests in parallel has the obvious > > > > potential of reducing run time, but we're currently hitting a hard > > limit > > > > since a few modules (flink-tests, flink-runtime, > > > > flink-table-planner-blink) are so loaded with tests that they nearly > > > > consume an entire profile by themselves (and thus no further > splitting > > > > is possible). > > > > > > > > The rework that introduced stages, at the time of introduction, did > > also > > > > not provide a speed up, although this changed slightly once more > > > > profiles were added and some optimizations to the caching have been > > made. > > > > > > > > Very recently we modified the surefire-plugin configuration for > > > > flink-table-planner-blink to reuse JVM forks for IT cases, providing > a > > > > significant speedup (18 minutes!). So far we have not seen any > negative > > > > consequences. > > > > > > > > > > > > Suggestions > > > > > > > > This is a list of /all /suggestions for reducing run/total times > that I > > > > have seen recently (in other words, they aren't necessarily mine nor > > may > > > > I agree with all of them). > > > > > > > > 1. Enable JVM reuse for IT cases in more modules. > > > > * We've seen significant speedups in the blink planner, and > this > > > > should be applicable for all modules. However, I presume > > there's > > > > a reason why we disabled JVM reuse (information on this would > > be > > > > appreciated) > > > > 2. Custom differential build scripts > > > > * Setup custom scripts for determining which modules might be > > > > affected by change, and manipulate the splits accordingly. > This > > > > approach is conceptually quite straight-forward, but has > limits > > > > since it has to be pessimistic; i.e. a change in flink-core > > > > _must_ result in testing all modules. > > > > 3. Only run smoke tests when PR is opened, run heavy tests on > demand. > > > > * With the introduction of the ci-bot we now have significantly > > > > more options on how to handle PR builds. One option could be > to > > > > only run basic tests when the PR is created (which may be > only > > > > modified modules, or all unit tests, or another low-cost > > > > scheme), and then have a committer trigger other builds (full > > > > test run, e2e tests, etc...) on demand. > > > > 4. Move more tests into cron builds > > > > * The budget version of 3); move certain tests that are either > > > > expensive (like some runtime tests that take minutes) or in > > > > rarely modified modules (like gelly) into cron jobs. > > > > 5. Gradle > > > > * Gradle was brought up a few times for it's built-in support > for > > > > differential builds; basically providing 2) without the > > overhead > > > > of maintaining additional scripts. > > > > * To date no PoC was provided that shows it working in our CI > > > > environment (i.e., handling splits & caching etc). > > > > * This is the most disruptive change by a fair margin, as it > > would > > > > affect the entire project, developers and potentially users > (f > > > > they build from source). > > > > 6. CI service > > > > * Our current artifact caching setup on Travis is basically a > > > > hack; we're basically abusing the Travis cache, which is > meant > > > > for long-term caching, to ship build artifacts across jobs. > > It's > > > > brittle at times due to timing/visibility issues and on > > branches > > > > the cleanup processes can interfere with running builds. It > is > > > > also not as effective as it could be. > > > > * There are CI services that provide build artifact caching out > > of > > > > the box, which could be useful for us. > > > > * To date, no PoC for using another CI service has been > provided. > > > > > > > > > > > > > >