Re: [DISCUSS] Reducing build times

Aljoscha Krettek Mon, 19 Aug 2019 01:27:08 -0700

I did a quick test: a normal "mvn clean install -DskipTests -Drat.skip=true 
-Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine takes about 14 
minutes. After removing all mentions of maven-shade-plugin the build time goes 
down to roughly 11.5 minutes. (Obviously the resulting Flink won’t work, 
because some expected stuff is not packaged and most of the end-to-end tests 
use the shade plugin to package the jars for testing.


Aljoscha

> On 18. Aug 2019, at 19:52, Robert Metzger <rmetz...@apache.org> wrote:
> 
> Hi all,
> 
> I wanted to understand the impact of the hardware we are using for running
> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory [1].
> They are using Google Cloud Compute Engine *n1-standard-2* instances.
> Running a full "mvn clean verify" takes *03:32 h* on such a machine type.
> 
> Running the same workload on a 32 virtual cores, 64 gb machine, takes *1:21
> h*.
> 
> What is interesting are the per-module build time differences.
> Modules which are parallelizing tests well greatly benefit from the
> additional cores:
> "flink-tests" 36:51 min vs 4:33 min
> "flink-runtime" 23:41 min vs 3:47 min
> "flink-table-planner" 15:54 min vs 3:13 min
> 
> On the other hand, we have modules which are not parallel at all:
> "flink-connector-kafka": 16:32 min vs 15:19 min
> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> Also, the checkstyle plugin is not scaling at all.
> 
> Chesnay reported some significant speedups by reusing forks.
> I don't know how much effort it would be to make the Kafka tests
> parallelizable. In total, they currently use 30 minutes on the big machine
> (while 31 CPUs are idling :) )
> 
> Let me know what you think about these results. If the community is
> generally interested in further investigating into that direction, I could
> look into software to orchestrate this, as well as sponsors for such an
> infrastructure.
> 
> [1] https://docs.travis-ci.com/user/reference/overview/
> 
> 
> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <ches...@apache.org> wrote:
> 
>> @Aljoscha Shading takes a few minutes for a full build; you can see this
>> quite easily by looking at the compile step in the misc profile
>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that
>> longer than a fraction of a section are usually caused by shading lots
>> of classes. Note that I cannot tell you how much of this is spent on
>> relocations, and how much on writing the jar.
>> 
>> Personally, I'd very much like us to move all shading to flink-shaded;
>> this would finally allows us to use newer maven versions without needing
>> cumbersome workarounds for flink-dist. However, this isn't a trivial
>> affair in some cases; IIRC calcite could be difficult to handle.
>> 
>> On another note, this would also simplify switching the main repo to
>> another build system, since you would no longer had to deal with
>> relocations, just packaging + merging NOTICE files.
>> 
>> @BowenLi I disagree, flink-shaded does not include any tests,  API
>> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
>> and flink-dist, where both relocate dependencies and one is bundled by
>> the other), and, most importantly, CI (and really, without CI being
>> covered in a PoC there's nothing to discuss).
>> 
>> On 16/08/2019 15:13, Aljoscha Krettek wrote:
>>> Speaking of flink-shaded, do we have any idea what the impact of shading
>> is on the build time? We could get rid of shading completely in the Flink
>> main repository by moving everything that we shade to flink-shaded.
>>> 
>>> Aljoscha
>>> 
>>>> On 16. Aug 2019, at 14:58, Bowen Li <bowenl...@gmail.com> wrote:
>>>> 
>>>> +1 to Till's points on #2 and #5, especially the potential
>> non-disruptive,
>>>> gradual migration approach if we decide to go that route.
>>>> 
>>>> To add on, I want to point it out that we can actually start with
>>>> flink-shaded project [1] which is a perfect candidate for PoC. It's of
>> much
>>>> smaller size, totally isolated from and not interfered with flink
>> project
>>>> [2], and it actually covers most of our practical feature requirements
>> for
>>>> a build tool - all making it an ideal experimental field.
>>>> 
>>>> [1] https://github.com/apache/flink-shaded
>>>> [2] https://github.com/apache/flink
>>>> 
>>>> 
>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>>> 
>>>>> For the sake of keeping the discussion focused and not cluttering the
>>>>> discussion thread I would suggest to split the detailed reporting for
>>>>> reusing JVMs to a separate thread and cross linking it from here.
>>>>> 
>>>>> Cheers,
>>>>> Till
>>>>> 
>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <ches...@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> Update:
>>>>>> 
>>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse right
>>>>>> away, while flink-tests has the potential for huge savings, but we
>> have
>>>>>> to figure out some issues first.
>>>>>> 
>>>>>> 
>>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>>>>>> 
>>>>>> 4/8 profiles failed.
>>>>>> 
>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved in
>>>>>> libraries (table-planner).
>>>>>> 
>>>>>> The kafka and connectors profiles both fail in kafka tests due to
>>>>>> producer leaks, and no speed up could be confirmed so far:
>>>>>> 
>>>>>> java.lang.AssertionError: Detected producer leak. Thread name:
>>>>>> kafka-producer-network-thread | producer-239
>>>>>>        at org.junit.Assert.fail(Assert.java:88)
>>>>>>        at
>>>>>> 
>>>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
>>>>>>        at
>>>>>> 
>>>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>>>>>> 
>>>>>> The tests profile failed due to various errors in migration tests:
>>>>>> 
>>>>>> junit.framework.AssertionFailedError: Did not see the expected
>>>>> accumulator
>>>>>> results within time limit.
>>>>>>        at
>>>>>> 
>>>>> 
>> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>>>>>> *However*, a normal tests run takes 40 minutes, while this one above
>>>>>> failed after 19 minutes and is only missing the migration tests (which
>>>>>> currently need 6-7 minutes). So we could save somewhere between 15 to
>> 20
>>>>>> minutes here.
>>>>>> 
>>>>>> 
>>>>>> Finally, the misc profiles fails in YARN:
>>>>>> 
>>>>>> java.lang.AssertionError
>>>>>>        at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>>>>>> 
>>>>>> No significant speedup could be observed in other modules; for
>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
>>>>>> 
>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
>>>>>>> There appears to be a general agreement that 1) should be looked
>> into;
>>>>>>> I've setup a branch with fork reuse being enabled for all tests; will
>>>>>>> report back the results.
>>>>>>> 
>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
>>>>>>>> Hello everyone,
>>>>>>>> 
>>>>>>>> improving our build times is a hot topic at the moment so let's
>>>>>>>> discuss the different ways how they could be reduced.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>       Current state:
>>>>>>>> 
>>>>>>>> First up, let's look at some numbers:
>>>>>>>> 
>>>>>>>> 1 full build currently consumes 5h of build time total ("total
>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
>>>>>>>> complete from start to finish. The run time may fluctuate of course
>>>>>>>> depending on the current Travis load. This applies both to builds on
>>>>>>>> the Apache and flink-ci Travis.
>>>>>>>> 
>>>>>>>> At the time of writing, the current queue time for PR jobs
>> (reminder:
>>>>>>>> running on flink-ci) is about 30 minutes (which basically means that
>>>>>>>> we are processing builds at the rate that they come in), however we
>>>>>>>> are in an admittedly quiet period right now.
>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
>>>>>>>> everyone was scrambling to get their changes merged in time for the
>>>>>>>> feature freeze.
>>>>>>>> 
>>>>>>>> (Note: Recently optimizations where added to ci-bot where pending
>>>>>>>> builds are canceled if a new commit was pushed to the PR or the PR
>>>>>>>> was closed, which should prove especially useful during the rush
>>>>>>>> hours we see before feature-freezes.)
>>>>>>>> 
>>>>>>>> 
>>>>>>>>       Past approaches
>>>>>>>> 
>>>>>>>> Over the years we have done rather few things to improve this
>>>>>>>> situation (hence our current predicament).
>>>>>>>> 
>>>>>>>> Beyond the sporadic speedup of some tests, the only notable
>> reduction
>>>>>>>> in total build times was the introduction of cron jobs, which
>>>>>>>> consolidated the per-commit matrix from 4 configurations (different
>>>>>>>> scala/hadoop versions) to 1.
>>>>>>>> 
>>>>>>>> The separation into multiple build profiles was only a work-around
>>>>>>>> for the 50m limit on Travis. Running tests in parallel has the
>>>>>>>> obvious potential of reducing run time, but we're currently hitting
>> a
>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime,
>>>>>>>> flink-table-planner-blink) are so loaded with tests that they nearly
>>>>>>>> consume an entire profile by themselves (and thus no further
>>>>>>>> splitting is possible).
>>>>>>>> 
>>>>>>>> The rework that introduced stages, at the time of introduction, did
>>>>>>>> also not provide a speed up, although this changed slightly once
>> more
>>>>>>>> profiles were added and some optimizations to the caching have been
>>>>>>>> made.
>>>>>>>> 
>>>>>>>> Very recently we modified the surefire-plugin configuration for
>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, providing
>>>>>>>> a significant speedup (18 minutes!). So far we have not seen any
>>>>>>>> negative consequences.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>       Suggestions
>>>>>>>> 
>>>>>>>> This is a list of /all /suggestions for reducing run/total times
>> that
>>>>>>>> I have seen recently (in other words, they aren't necessarily mine
>>>>>>>> nor may I agree with all of them).
>>>>>>>> 
>>>>>>>> 1. Enable JVM reuse for IT cases in more modules.
>>>>>>>>     * We've seen significant speedups in the blink planner, and
>> this
>>>>>>>>       should be applicable for all modules. However, I presume
>>>>> there's
>>>>>>>>       a reason why we disabled JVM reuse (information on this would
>>>>> be
>>>>>>>>       appreciated)
>>>>>>>> 2. Custom differential build scripts
>>>>>>>>     * Setup custom scripts for determining which modules might be
>>>>>>>>       affected by change, and manipulate the splits accordingly.
>> This
>>>>>>>>       approach is conceptually quite straight-forward, but has
>> limits
>>>>>>>>       since it has to be pessimistic; i.e. a change in flink-core
>>>>>>>>       _must_ result in testing all modules.
>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on
>> demand.
>>>>>>>>     * With the introduction of the ci-bot we now have significantly
>>>>>>>>       more options on how to handle PR builds. One option could be
>> to
>>>>>>>>       only run basic tests when the PR is created (which may be
>> only
>>>>>>>>       modified modules, or all unit tests, or another low-cost
>>>>>>>>       scheme), and then have a committer trigger other builds (full
>>>>>>>>       test run, e2e tests, etc...) on demand.
>>>>>>>> 4. Move more tests into cron builds
>>>>>>>>     * The budget version of 3); move certain tests that are either
>>>>>>>>       expensive (like some runtime tests that take minutes) or in
>>>>>>>>       rarely modified modules (like gelly) into cron jobs.
>>>>>>>> 5. Gradle
>>>>>>>>     * Gradle was brought up a few times for it's built-in support
>> for
>>>>>>>>       differential builds; basically providing 2) without the
>>>>> overhead
>>>>>>>>       of maintaining additional scripts.
>>>>>>>>     * To date no PoC was provided that shows it working in our CI
>>>>>>>>       environment (i.e., handling splits & caching etc).
>>>>>>>>     * This is the most disruptive change by a fair margin, as it
>>>>> would
>>>>>>>>       affect the entire project, developers and potentially users
>> (f
>>>>>>>>       they build from source).
>>>>>>>> 6. CI service
>>>>>>>>     * Our current artifact caching setup on Travis is basically a
>>>>>>>>       hack; we're basically abusing the Travis cache, which is
>> meant
>>>>>>>>       for long-term caching, to ship build artifacts across jobs.
>>>>> It's
>>>>>>>>       brittle at times due to timing/visibility issues and on
>>>>> branches
>>>>>>>>       the cleanup processes can interfere with running builds. It
>> is
>>>>>>>>       also not as effective as it could be.
>>>>>>>>     * There are CI services that provide build artifact caching out
>>>>> of
>>>>>>>>       the box, which could be useful for us.
>>>>>>>>     * To date, no PoC for using another CI service has been
>> provided.
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> 
>>

Re: [DISCUSS] Reducing build times

Reply via email to