Yes, we can ensure the same (or better) experience for contributors. On the powerful machines, builds finish in 1.5 hours (without any caching enabled).
Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a build for open source projects. Flink needs 3.5 hours on that infra (not parallelized at all, no caching). These free machines are very similar to those of Travis, so I expect no build time regressions, if we set it up similarly. On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler <ches...@apache.org> wrote: > Will using more powerful for the project make it more difficult to > ensure that contributor builds are still running in a reasonable time? > > As an example of this happening on Travis, contributors currently cannot > run all e2e tests since they timeout, but on apache we have a larger > timeout. > > On 03/09/2019 18:57, Robert Metzger wrote: > > Hi all, > > > > I wanted to give a short update on this: > > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently > > working on making all modules compile and test with Gradle. We've also > > identified some problematic areas (shading being the most obvious one) > > which we will analyse as part of the PoC. > > The goal is to see how much Gradle helps to parallelise our build, and to > > avoid duplicate work (incremental builds). > > > > - I am working on setting up a Flink testing infrastructure based on > Azure > > Pipelines, using more powerful hardware. Alibaba kindly provided me with > > two 32 core machines (temporarily), and another company reached out to > > privately, looking into options for cheap, fast machines :) > > If nobody in the community disagrees, I am going to set up Azure > Pipelines > > with our apache/flink GitHub as a build infrastructure that exists next > to > > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is > > equally or even more reliable than Travis, and I want to see what the > > required maintenance work is. > > On top of that, Azure Pipelines is a very feature-rich tool with a lot of > > nice options for us to improve the build experience (statistics about > tests > > (flaky tests etc.), nice docker support, plenty of free build resources > for > > open source projects, ...) > > > > Best, > > Robert > > > > > > > > > > > > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <rmetz...@apache.org> > wrote: > > > >> Hi all, > >> > >> I have summarized all arguments mentioned so far + some additional > >> research into a Wiki page here: > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279 > >> > >> I'm happy to hear further comments on my summary! I'm pretty sure we can > >> find more pro's and con's for the different options. > >> > >> My opinion after looking at the options: > >> > >> - Flink relies on an outdated build tool (Maven), while a good > >> alternative is well-established (gradle), and will likely provide a > much > >> better CI and local build experience through incremental build and > cached > >> intermediates. > >> Scripting around Maven, or splitting modules / test execution / > >> repositories won't solve this problem. We should rather spend the > effort in > >> migrating to a modern build tool which will provide us benefits in > the long > >> run. > >> - Flink relies on a fairly slow build service (Travis CI), while > >> simply putting more money onto the problem could cut the build time > at > >> least in half. > >> We should consider using a build service that provides bigger > machines > >> to solve our build time problem. > >> > >> My opinion is based on many assumptions (gradle is actually as fast as > >> promised (haven't used it before), we can build Flink with gradle, we > find > >> sponsors for bigger build machines) that we need to test first through > PoCs. > >> > >> Best, > >> Robert > >> > >> > >> > >> > >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <aljos...@apache.org> > >> wrote: > >> > >>> I did a quick test: a normal "mvn clean install -DskipTests > >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my > machine > >>> takes about 14 minutes. After removing all mentions of > maven-shade-plugin > >>> the build time goes down to roughly 11.5 minutes. (Obviously the > resulting > >>> Flink won’t work, because some expected stuff is not packaged and most > of > >>> the end-to-end tests use the shade plugin to package the jars for > testing. > >>> > >>> Aljoscha > >>> > >>>> On 18. Aug 2019, at 19:52, Robert Metzger <rmetz...@apache.org> > wrote: > >>>> > >>>> Hi all, > >>>> > >>>> I wanted to understand the impact of the hardware we are using for > >>> running > >>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory > >>> [1]. > >>>> They are using Google Cloud Compute Engine *n1-standard-2* instances. > >>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine > >>> type. > >>>> Running the same workload on a 32 virtual cores, 64 gb machine, takes > >>> *1:21 > >>>> h*. > >>>> > >>>> What is interesting are the per-module build time differences. > >>>> Modules which are parallelizing tests well greatly benefit from the > >>>> additional cores: > >>>> "flink-tests" 36:51 min vs 4:33 min > >>>> "flink-runtime" 23:41 min vs 3:47 min > >>>> "flink-table-planner" 15:54 min vs 3:13 min > >>>> > >>>> On the other hand, we have modules which are not parallel at all: > >>>> "flink-connector-kafka": 16:32 min vs 15:19 min > >>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min > >>>> Also, the checkstyle plugin is not scaling at all. > >>>> > >>>> Chesnay reported some significant speedups by reusing forks. > >>>> I don't know how much effort it would be to make the Kafka tests > >>>> parallelizable. In total, they currently use 30 minutes on the big > >>> machine > >>>> (while 31 CPUs are idling :) ) > >>>> > >>>> Let me know what you think about these results. If the community is > >>>> generally interested in further investigating into that direction, I > >>> could > >>>> look into software to orchestrate this, as well as sponsors for such > an > >>>> infrastructure. > >>>> > >>>> [1] https://docs.travis-ci.com/user/reference/overview/ > >>>> > >>>> > >>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <ches...@apache.org> > >>> wrote: > >>>>> @Aljoscha Shading takes a few minutes for a full build; you can see > >>> this > >>>>> quite easily by looking at the compile step in the misc profile > >>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules > that > >>>>> longer than a fraction of a section are usually caused by shading > lots > >>>>> of classes. Note that I cannot tell you how much of this is spent on > >>>>> relocations, and how much on writing the jar. > >>>>> > >>>>> Personally, I'd very much like us to move all shading to > flink-shaded; > >>>>> this would finally allows us to use newer maven versions without > >>> needing > >>>>> cumbersome workarounds for flink-dist. However, this isn't a trivial > >>>>> affair in some cases; IIRC calcite could be difficult to handle. > >>>>> > >>>>> On another note, this would also simplify switching the main repo to > >>>>> another build system, since you would no longer had to deal with > >>>>> relocations, just packaging + merging NOTICE files. > >>>>> > >>>>> @BowenLi I disagree, flink-shaded does not include any tests, API > >>>>> compatibility checks, checkstyle, layered shading (e.g., > flink-runtime > >>>>> and flink-dist, where both relocate dependencies and one is bundled > by > >>>>> the other), and, most importantly, CI (and really, without CI being > >>>>> covered in a PoC there's nothing to discuss). > >>>>> > >>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote: > >>>>>> Speaking of flink-shaded, do we have any idea what the impact of > >>> shading > >>>>> is on the build time? We could get rid of shading completely in the > >>> Flink > >>>>> main repository by moving everything that we shade to flink-shaded. > >>>>>> Aljoscha > >>>>>> > >>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <bowenl...@gmail.com> wrote: > >>>>>>> > >>>>>>> +1 to Till's points on #2 and #5, especially the potential > >>>>> non-disruptive, > >>>>>>> gradual migration approach if we decide to go that route. > >>>>>>> > >>>>>>> To add on, I want to point it out that we can actually start with > >>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. It's > >>> of > >>>>> much > >>>>>>> smaller size, totally isolated from and not interfered with flink > >>>>> project > >>>>>>> [2], and it actually covers most of our practical feature > >>> requirements > >>>>> for > >>>>>>> a build tool - all making it an ideal experimental field. > >>>>>>> > >>>>>>> [1] https://github.com/apache/flink-shaded > >>>>>>> [2] https://github.com/apache/flink > >>>>>>> > >>>>>>> > >>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann < > trohrm...@apache.org> > >>>>> wrote: > >>>>>>>> For the sake of keeping the discussion focused and not cluttering > >>> the > >>>>>>>> discussion thread I would suggest to split the detailed reporting > >>> for > >>>>>>>> reusing JVMs to a separate thread and cross linking it from here. > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> Till > >>>>>>>> > >>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler < > >>> ches...@apache.org> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Update: > >>>>>>>>> > >>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse > >>> right > >>>>>>>>> away, while flink-tests has the potential for huge savings, but > we > >>>>> have > >>>>>>>>> to figure out some issues first. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220 > >>>>>>>>> > >>>>>>>>> 4/8 profiles failed. > >>>>>>>>> > >>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved > in > >>>>>>>>> libraries (table-planner). > >>>>>>>>> > >>>>>>>>> The kafka and connectors profiles both fail in kafka tests due to > >>>>>>>>> producer leaks, and no speed up could be confirmed so far: > >>>>>>>>> > >>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name: > >>>>>>>>> kafka-producer-network-thread | producer-239 > >>>>>>>>> at org.junit.Assert.fail(Assert.java:88) > >>>>>>>>> at > >>>>>>>>> > >>> > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677) > >>>>>>>>> at > >>>>>>>>> > >>> > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210) > >>>>>>>>> The tests profile failed due to various errors in migration > tests: > >>>>>>>>> > >>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected > >>>>>>>> accumulator > >>>>>>>>> results within time limit. > >>>>>>>>> at > >>>>>>>>> > >>> > org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141) > >>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one > >>> above > >>>>>>>>> failed after 19 minutes and is only missing the migration tests > >>> (which > >>>>>>>>> currently need 6-7 minutes). So we could save somewhere between > 15 > >>> to > >>>>> 20 > >>>>>>>>> minutes here. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Finally, the misc profiles fails in YARN: > >>>>>>>>> > >>>>>>>>> java.lang.AssertionError > >>>>>>>>> at > >>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64) > >>>>>>>>> No significant speedup could be observed in other modules; for > >>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it. > >>>>>>>>> > >>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote: > >>>>>>>>>> There appears to be a general agreement that 1) should be looked > >>>>> into; > >>>>>>>>>> I've setup a branch with fork reuse being enabled for all tests; > >>> will > >>>>>>>>>> report back the results. > >>>>>>>>>> > >>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote: > >>>>>>>>>>> Hello everyone, > >>>>>>>>>>> > >>>>>>>>>>> improving our build times is a hot topic at the moment so let's > >>>>>>>>>>> discuss the different ways how they could be reduced. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Current state: > >>>>>>>>>>> > >>>>>>>>>>> First up, let's look at some numbers: > >>>>>>>>>>> > >>>>>>>>>>> 1 full build currently consumes 5h of build time total ("total > >>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to > >>>>>>>>>>> complete from start to finish. The run time may fluctuate of > >>> course > >>>>>>>>>>> depending on the current Travis load. This applies both to > >>> builds on > >>>>>>>>>>> the Apache and flink-ci Travis. > >>>>>>>>>>> > >>>>>>>>>>> At the time of writing, the current queue time for PR jobs > >>>>> (reminder: > >>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically means > >>> that > >>>>>>>>>>> we are processing builds at the rate that they come in), > however > >>> we > >>>>>>>>>>> are in an admittedly quiet period right now. > >>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h > as > >>>>>>>>>>> everyone was scrambling to get their changes merged in time for > >>> the > >>>>>>>>>>> feature freeze. > >>>>>>>>>>> > >>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where > pending > >>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or the > >>> PR > >>>>>>>>>>> was closed, which should prove especially useful during the > rush > >>>>>>>>>>> hours we see before feature-freezes.) > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Past approaches > >>>>>>>>>>> > >>>>>>>>>>> Over the years we have done rather few things to improve this > >>>>>>>>>>> situation (hence our current predicament). > >>>>>>>>>>> > >>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable > >>>>> reduction > >>>>>>>>>>> in total build times was the introduction of cron jobs, which > >>>>>>>>>>> consolidated the per-commit matrix from 4 configurations > >>> (different > >>>>>>>>>>> scala/hadoop versions) to 1. > >>>>>>>>>>> > >>>>>>>>>>> The separation into multiple build profiles was only a > >>> work-around > >>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has the > >>>>>>>>>>> obvious potential of reducing run time, but we're currently > >>> hitting > >>>>> a > >>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime, > >>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they > >>> nearly > >>>>>>>>>>> consume an entire profile by themselves (and thus no further > >>>>>>>>>>> splitting is possible). > >>>>>>>>>>> > >>>>>>>>>>> The rework that introduced stages, at the time of introduction, > >>> did > >>>>>>>>>>> also not provide a speed up, although this changed slightly > once > >>>>> more > >>>>>>>>>>> profiles were added and some optimizations to the caching have > >>> been > >>>>>>>>>>> made. > >>>>>>>>>>> > >>>>>>>>>>> Very recently we modified the surefire-plugin configuration for > >>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, > >>> providing > >>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen > any > >>>>>>>>>>> negative consequences. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Suggestions > >>>>>>>>>>> > >>>>>>>>>>> This is a list of /all /suggestions for reducing run/total > times > >>>>> that > >>>>>>>>>>> I have seen recently (in other words, they aren't necessarily > >>> mine > >>>>>>>>>>> nor may I agree with all of them). > >>>>>>>>>>> > >>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules. > >>>>>>>>>>> * We've seen significant speedups in the blink planner, > and > >>>>> this > >>>>>>>>>>> should be applicable for all modules. However, I presume > >>>>>>>> there's > >>>>>>>>>>> a reason why we disabled JVM reuse (information on this > >>> would > >>>>>>>> be > >>>>>>>>>>> appreciated) > >>>>>>>>>>> 2. Custom differential build scripts > >>>>>>>>>>> * Setup custom scripts for determining which modules > might be > >>>>>>>>>>> affected by change, and manipulate the splits > accordingly. > >>>>> This > >>>>>>>>>>> approach is conceptually quite straight-forward, but has > >>>>> limits > >>>>>>>>>>> since it has to be pessimistic; i.e. a change in > flink-core > >>>>>>>>>>> _must_ result in testing all modules. > >>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on > >>>>> demand. > >>>>>>>>>>> * With the introduction of the ci-bot we now have > >>> significantly > >>>>>>>>>>> more options on how to handle PR builds. One option > could > >>> be > >>>>> to > >>>>>>>>>>> only run basic tests when the PR is created (which may > be > >>>>> only > >>>>>>>>>>> modified modules, or all unit tests, or another low-cost > >>>>>>>>>>> scheme), and then have a committer trigger other builds > >>> (full > >>>>>>>>>>> test run, e2e tests, etc...) on demand. > >>>>>>>>>>> 4. Move more tests into cron builds > >>>>>>>>>>> * The budget version of 3); move certain tests that are > >>> either > >>>>>>>>>>> expensive (like some runtime tests that take minutes) > or in > >>>>>>>>>>> rarely modified modules (like gelly) into cron jobs. > >>>>>>>>>>> 5. Gradle > >>>>>>>>>>> * Gradle was brought up a few times for it's built-in > support > >>>>> for > >>>>>>>>>>> differential builds; basically providing 2) without the > >>>>>>>> overhead > >>>>>>>>>>> of maintaining additional scripts. > >>>>>>>>>>> * To date no PoC was provided that shows it working in > our CI > >>>>>>>>>>> environment (i.e., handling splits & caching etc). > >>>>>>>>>>> * This is the most disruptive change by a fair margin, as > it > >>>>>>>> would > >>>>>>>>>>> affect the entire project, developers and potentially > users > >>>>> (f > >>>>>>>>>>> they build from source). > >>>>>>>>>>> 6. CI service > >>>>>>>>>>> * Our current artifact caching setup on Travis is > basically a > >>>>>>>>>>> hack; we're basically abusing the Travis cache, which is > >>>>> meant > >>>>>>>>>>> for long-term caching, to ship build artifacts across > jobs. > >>>>>>>> It's > >>>>>>>>>>> brittle at times due to timing/visibility issues and on > >>>>>>>> branches > >>>>>>>>>>> the cleanup processes can interfere with running > builds. It > >>>>> is > >>>>>>>>>>> also not as effective as it could be. > >>>>>>>>>>> * There are CI services that provide build artifact > caching > >>> out > >>>>>>>> of > >>>>>>>>>>> the box, which could be useful for us. > >>>>>>>>>>> * To date, no PoC for using another CI service has been > >>>>> provided. > >>>>>>>>>>> > >>>>> > >>> > >