I did a quick test: a normal "mvn clean install -DskipTests -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine takes about 14 minutes. After removing all mentions of maven-shade-plugin the build time goes down to roughly 11.5 minutes. (Obviously the resulting Flink won’t work, because some expected stuff is not packaged and most of the end-to-end tests use the shade plugin to package the jars for testing.
Aljoscha > On 18. Aug 2019, at 19:52, Robert Metzger <rmetz...@apache.org> wrote: > > Hi all, > > I wanted to understand the impact of the hardware we are using for running > our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory [1]. > They are using Google Cloud Compute Engine *n1-standard-2* instances. > Running a full "mvn clean verify" takes *03:32 h* on such a machine type. > > Running the same workload on a 32 virtual cores, 64 gb machine, takes *1:21 > h*. > > What is interesting are the per-module build time differences. > Modules which are parallelizing tests well greatly benefit from the > additional cores: > "flink-tests" 36:51 min vs 4:33 min > "flink-runtime" 23:41 min vs 3:47 min > "flink-table-planner" 15:54 min vs 3:13 min > > On the other hand, we have modules which are not parallel at all: > "flink-connector-kafka": 16:32 min vs 15:19 min > "flink-connector-kafka-0.11": 9:52 min vs 7:46 min > Also, the checkstyle plugin is not scaling at all. > > Chesnay reported some significant speedups by reusing forks. > I don't know how much effort it would be to make the Kafka tests > parallelizable. In total, they currently use 30 minutes on the big machine > (while 31 CPUs are idling :) ) > > Let me know what you think about these results. If the community is > generally interested in further investigating into that direction, I could > look into software to orchestrate this, as well as sponsors for such an > infrastructure. > > [1] https://docs.travis-ci.com/user/reference/overview/ > > > On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <ches...@apache.org> wrote: > >> @Aljoscha Shading takes a few minutes for a full build; you can see this >> quite easily by looking at the compile step in the misc profile >> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that >> longer than a fraction of a section are usually caused by shading lots >> of classes. Note that I cannot tell you how much of this is spent on >> relocations, and how much on writing the jar. >> >> Personally, I'd very much like us to move all shading to flink-shaded; >> this would finally allows us to use newer maven versions without needing >> cumbersome workarounds for flink-dist. However, this isn't a trivial >> affair in some cases; IIRC calcite could be difficult to handle. >> >> On another note, this would also simplify switching the main repo to >> another build system, since you would no longer had to deal with >> relocations, just packaging + merging NOTICE files. >> >> @BowenLi I disagree, flink-shaded does not include any tests, API >> compatibility checks, checkstyle, layered shading (e.g., flink-runtime >> and flink-dist, where both relocate dependencies and one is bundled by >> the other), and, most importantly, CI (and really, without CI being >> covered in a PoC there's nothing to discuss). >> >> On 16/08/2019 15:13, Aljoscha Krettek wrote: >>> Speaking of flink-shaded, do we have any idea what the impact of shading >> is on the build time? We could get rid of shading completely in the Flink >> main repository by moving everything that we shade to flink-shaded. >>> >>> Aljoscha >>> >>>> On 16. Aug 2019, at 14:58, Bowen Li <bowenl...@gmail.com> wrote: >>>> >>>> +1 to Till's points on #2 and #5, especially the potential >> non-disruptive, >>>> gradual migration approach if we decide to go that route. >>>> >>>> To add on, I want to point it out that we can actually start with >>>> flink-shaded project [1] which is a perfect candidate for PoC. It's of >> much >>>> smaller size, totally isolated from and not interfered with flink >> project >>>> [2], and it actually covers most of our practical feature requirements >> for >>>> a build tool - all making it an ideal experimental field. >>>> >>>> [1] https://github.com/apache/flink-shaded >>>> [2] https://github.com/apache/flink >>>> >>>> >>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <trohrm...@apache.org> >> wrote: >>>> >>>>> For the sake of keeping the discussion focused and not cluttering the >>>>> discussion thread I would suggest to split the detailed reporting for >>>>> reusing JVMs to a separate thread and cross linking it from here. >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <ches...@apache.org> >>>>> wrote: >>>>> >>>>>> Update: >>>>>> >>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse right >>>>>> away, while flink-tests has the potential for huge savings, but we >> have >>>>>> to figure out some issues first. >>>>>> >>>>>> >>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220 >>>>>> >>>>>> 4/8 profiles failed. >>>>>> >>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved in >>>>>> libraries (table-planner). >>>>>> >>>>>> The kafka and connectors profiles both fail in kafka tests due to >>>>>> producer leaks, and no speed up could be confirmed so far: >>>>>> >>>>>> java.lang.AssertionError: Detected producer leak. Thread name: >>>>>> kafka-producer-network-thread | producer-239 >>>>>> at org.junit.Assert.fail(Assert.java:88) >>>>>> at >>>>>> >>>>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677) >>>>>> at >>>>>> >>>>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210) >>>>>> >>>>>> The tests profile failed due to various errors in migration tests: >>>>>> >>>>>> junit.framework.AssertionFailedError: Did not see the expected >>>>> accumulator >>>>>> results within time limit. >>>>>> at >>>>>> >>>>> >> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141) >>>>>> *However*, a normal tests run takes 40 minutes, while this one above >>>>>> failed after 19 minutes and is only missing the migration tests (which >>>>>> currently need 6-7 minutes). So we could save somewhere between 15 to >> 20 >>>>>> minutes here. >>>>>> >>>>>> >>>>>> Finally, the misc profiles fails in YARN: >>>>>> >>>>>> java.lang.AssertionError >>>>>> at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64) >>>>>> >>>>>> No significant speedup could be observed in other modules; for >>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it. >>>>>> >>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote: >>>>>>> There appears to be a general agreement that 1) should be looked >> into; >>>>>>> I've setup a branch with fork reuse being enabled for all tests; will >>>>>>> report back the results. >>>>>>> >>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote: >>>>>>>> Hello everyone, >>>>>>>> >>>>>>>> improving our build times is a hot topic at the moment so let's >>>>>>>> discuss the different ways how they could be reduced. >>>>>>>> >>>>>>>> >>>>>>>> Current state: >>>>>>>> >>>>>>>> First up, let's look at some numbers: >>>>>>>> >>>>>>>> 1 full build currently consumes 5h of build time total ("total >>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to >>>>>>>> complete from start to finish. The run time may fluctuate of course >>>>>>>> depending on the current Travis load. This applies both to builds on >>>>>>>> the Apache and flink-ci Travis. >>>>>>>> >>>>>>>> At the time of writing, the current queue time for PR jobs >> (reminder: >>>>>>>> running on flink-ci) is about 30 minutes (which basically means that >>>>>>>> we are processing builds at the rate that they come in), however we >>>>>>>> are in an admittedly quiet period right now. >>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as >>>>>>>> everyone was scrambling to get their changes merged in time for the >>>>>>>> feature freeze. >>>>>>>> >>>>>>>> (Note: Recently optimizations where added to ci-bot where pending >>>>>>>> builds are canceled if a new commit was pushed to the PR or the PR >>>>>>>> was closed, which should prove especially useful during the rush >>>>>>>> hours we see before feature-freezes.) >>>>>>>> >>>>>>>> >>>>>>>> Past approaches >>>>>>>> >>>>>>>> Over the years we have done rather few things to improve this >>>>>>>> situation (hence our current predicament). >>>>>>>> >>>>>>>> Beyond the sporadic speedup of some tests, the only notable >> reduction >>>>>>>> in total build times was the introduction of cron jobs, which >>>>>>>> consolidated the per-commit matrix from 4 configurations (different >>>>>>>> scala/hadoop versions) to 1. >>>>>>>> >>>>>>>> The separation into multiple build profiles was only a work-around >>>>>>>> for the 50m limit on Travis. Running tests in parallel has the >>>>>>>> obvious potential of reducing run time, but we're currently hitting >> a >>>>>>>> hard limit since a few modules (flink-tests, flink-runtime, >>>>>>>> flink-table-planner-blink) are so loaded with tests that they nearly >>>>>>>> consume an entire profile by themselves (and thus no further >>>>>>>> splitting is possible). >>>>>>>> >>>>>>>> The rework that introduced stages, at the time of introduction, did >>>>>>>> also not provide a speed up, although this changed slightly once >> more >>>>>>>> profiles were added and some optimizations to the caching have been >>>>>>>> made. >>>>>>>> >>>>>>>> Very recently we modified the surefire-plugin configuration for >>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, providing >>>>>>>> a significant speedup (18 minutes!). So far we have not seen any >>>>>>>> negative consequences. >>>>>>>> >>>>>>>> >>>>>>>> Suggestions >>>>>>>> >>>>>>>> This is a list of /all /suggestions for reducing run/total times >> that >>>>>>>> I have seen recently (in other words, they aren't necessarily mine >>>>>>>> nor may I agree with all of them). >>>>>>>> >>>>>>>> 1. Enable JVM reuse for IT cases in more modules. >>>>>>>> * We've seen significant speedups in the blink planner, and >> this >>>>>>>> should be applicable for all modules. However, I presume >>>>> there's >>>>>>>> a reason why we disabled JVM reuse (information on this would >>>>> be >>>>>>>> appreciated) >>>>>>>> 2. Custom differential build scripts >>>>>>>> * Setup custom scripts for determining which modules might be >>>>>>>> affected by change, and manipulate the splits accordingly. >> This >>>>>>>> approach is conceptually quite straight-forward, but has >> limits >>>>>>>> since it has to be pessimistic; i.e. a change in flink-core >>>>>>>> _must_ result in testing all modules. >>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on >> demand. >>>>>>>> * With the introduction of the ci-bot we now have significantly >>>>>>>> more options on how to handle PR builds. One option could be >> to >>>>>>>> only run basic tests when the PR is created (which may be >> only >>>>>>>> modified modules, or all unit tests, or another low-cost >>>>>>>> scheme), and then have a committer trigger other builds (full >>>>>>>> test run, e2e tests, etc...) on demand. >>>>>>>> 4. Move more tests into cron builds >>>>>>>> * The budget version of 3); move certain tests that are either >>>>>>>> expensive (like some runtime tests that take minutes) or in >>>>>>>> rarely modified modules (like gelly) into cron jobs. >>>>>>>> 5. Gradle >>>>>>>> * Gradle was brought up a few times for it's built-in support >> for >>>>>>>> differential builds; basically providing 2) without the >>>>> overhead >>>>>>>> of maintaining additional scripts. >>>>>>>> * To date no PoC was provided that shows it working in our CI >>>>>>>> environment (i.e., handling splits & caching etc). >>>>>>>> * This is the most disruptive change by a fair margin, as it >>>>> would >>>>>>>> affect the entire project, developers and potentially users >> (f >>>>>>>> they build from source). >>>>>>>> 6. CI service >>>>>>>> * Our current artifact caching setup on Travis is basically a >>>>>>>> hack; we're basically abusing the Travis cache, which is >> meant >>>>>>>> for long-term caching, to ship build artifacts across jobs. >>>>> It's >>>>>>>> brittle at times due to timing/visibility issues and on >>>>> branches >>>>>>>> the cleanup processes can interfere with running builds. It >> is >>>>>>>> also not as effective as it could be. >>>>>>>> * There are CI services that provide build artifact caching out >>>>> of >>>>>>>> the box, which could be useful for us. >>>>>>>> * To date, no PoC for using another CI service has been >> provided. >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> >>