Re: [DISCUSS] Reducing build times

Till Rohrmann Wed, 04 Sep 2019 04:26:19 -0700

How long do we need to run all e2e tests? They are not included in the 3,5
hours I assume.


Cheers,
Till

On Wed, Sep 4, 2019 at 11:59 AM Robert Metzger <rmetz...@apache.org> wrote:

> Yes, we can ensure the same (or better) experience for contributors.
>
> On the powerful machines, builds finish in 1.5 hours (without any caching
> enabled).
>
> Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a
> build for open source projects. Flink needs 3.5 hours on that infra (not
> parallelized at all, no caching). These free machines are very similar to
> those of Travis, so I expect no build time regressions, if we set it up
> similarly.
>
>
> On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler <ches...@apache.org>
> wrote:
>
> > Will using more powerful for the project make it more difficult to
> > ensure that contributor builds are still running in a reasonable time?
> >
> > As an example of this happening on Travis, contributors currently cannot
> > run all e2e tests since they timeout, but on apache we have a larger
> > timeout.
> >
> > On 03/09/2019 18:57, Robert Metzger wrote:
> > > Hi all,
> > >
> > > I wanted to give a short update on this:
> > > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently
> > > working on making all modules compile and test with Gradle. We've also
> > > identified some problematic areas (shading being the most obvious one)
> > > which we will analyse as part of the PoC.
> > > The goal is to see how much Gradle helps to parallelise our build, and
> to
> > > avoid duplicate work (incremental builds).
> > >
> > > - I am working on setting up a Flink testing infrastructure based on
> > Azure
> > > Pipelines, using more powerful hardware. Alibaba kindly provided me
> with
> > > two 32 core machines (temporarily), and another company reached out to
> > > privately, looking into options for cheap, fast machines :)
> > > If nobody in the community disagrees, I am going to set up Azure
> > Pipelines
> > > with our apache/flink GitHub as a build infrastructure that exists next
> > to
> > > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines
> is
> > > equally or even more reliable than Travis, and I want to see what the
> > > required maintenance work is.
> > > On top of that, Azure Pipelines is a very feature-rich tool with a lot
> of
> > > nice options for us to improve the build experience (statistics about
> > tests
> > > (flaky tests etc.), nice docker support, plenty of free build resources
> > for
> > > open source projects, ...)
> > >
> > > Best,
> > > Robert
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <rmetz...@apache.org>
> > wrote:
> > >
> > >> Hi all,
> > >>
> > >> I have summarized all arguments mentioned so far + some additional
> > >> research into a Wiki page here:
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
> > >>
> > >> I'm happy to hear further comments on my summary! I'm pretty sure we
> can
> > >> find more pro's and con's for the different options.
> > >>
> > >> My opinion after looking at the options:
> > >>
> > >>     - Flink relies on an outdated build tool (Maven), while a good
> > >>     alternative is well-established (gradle), and will likely provide
> a
> > much
> > >>     better CI and local build experience through incremental build and
> > cached
> > >>     intermediates.
> > >>     Scripting around Maven, or splitting modules / test execution /
> > >>     repositories won't solve this problem. We should rather spend the
> > effort in
> > >>     migrating to a modern build tool which will provide us benefits in
> > the long
> > >>     run.
> > >>     - Flink relies on a fairly slow build service (Travis CI), while
> > >>     simply putting more money onto the problem could cut the build
> time
> > at
> > >>     least in half.
> > >>     We should consider using a build service that provides bigger
> > machines
> > >>     to solve our build time problem.
> > >>
> > >> My opinion is based on many assumptions (gradle is actually as fast as
> > >> promised (haven't used it before), we can build Flink with gradle, we
> > find
> > >> sponsors for bigger build machines) that we need to test first through
> > PoCs.
> > >>
> > >> Best,
> > >> Robert
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <
> aljos...@apache.org>
> > >> wrote:
> > >>
> > >>> I did a quick test: a normal "mvn clean install -DskipTests
> > >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my
> > machine
> > >>> takes about 14 minutes. After removing all mentions of
> > maven-shade-plugin
> > >>> the build time goes down to roughly 11.5 minutes. (Obviously the
> > resulting
> > >>> Flink won’t work, because some expected stuff is not packaged and
> most
> > of
> > >>> the end-to-end tests use the shade plugin to package the jars for
> > testing.
> > >>>
> > >>> Aljoscha
> > >>>
> > >>>> On 18. Aug 2019, at 19:52, Robert Metzger <rmetz...@apache.org>
> > wrote:
> > >>>>
> > >>>> Hi all,
> > >>>>
> > >>>> I wanted to understand the impact of the hardware we are using for
> > >>> running
> > >>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory
> > >>> [1].
> > >>>> They are using Google Cloud Compute Engine *n1-standard-2*
> instances.
> > >>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine
> > >>> type.
> > >>>> Running the same workload on a 32 virtual cores, 64 gb machine,
> takes
> > >>> *1:21
> > >>>> h*.
> > >>>>
> > >>>> What is interesting are the per-module build time differences.
> > >>>> Modules which are parallelizing tests well greatly benefit from the
> > >>>> additional cores:
> > >>>> "flink-tests" 36:51 min vs 4:33 min
> > >>>> "flink-runtime" 23:41 min vs 3:47 min
> > >>>> "flink-table-planner" 15:54 min vs 3:13 min
> > >>>>
> > >>>> On the other hand, we have modules which are not parallel at all:
> > >>>> "flink-connector-kafka": 16:32 min vs 15:19 min
> > >>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> > >>>> Also, the checkstyle plugin is not scaling at all.
> > >>>>
> > >>>> Chesnay reported some significant speedups by reusing forks.
> > >>>> I don't know how much effort it would be to make the Kafka tests
> > >>>> parallelizable. In total, they currently use 30 minutes on the big
> > >>> machine
> > >>>> (while 31 CPUs are idling :) )
> > >>>>
> > >>>> Let me know what you think about these results. If the community is
> > >>>> generally interested in further investigating into that direction, I
> > >>> could
> > >>>> look into software to orchestrate this, as well as sponsors for such
> > an
> > >>>> infrastructure.
> > >>>>
> > >>>> [1] https://docs.travis-ci.com/user/reference/overview/
> > >>>>
> > >>>>
> > >>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <
> ches...@apache.org>
> > >>> wrote:
> > >>>>> @Aljoscha Shading takes a few minutes for a full build; you can see
> > >>> this
> > >>>>> quite easily by looking at the compile step in the misc profile
> > >>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules
> > that
> > >>>>> longer than a fraction of a section are usually caused by shading
> > lots
> > >>>>> of classes. Note that I cannot tell you how much of this is spent
> on
> > >>>>> relocations, and how much on writing the jar.
> > >>>>>
> > >>>>> Personally, I'd very much like us to move all shading to
> > flink-shaded;
> > >>>>> this would finally allows us to use newer maven versions without
> > >>> needing
> > >>>>> cumbersome workarounds for flink-dist. However, this isn't a
> trivial
> > >>>>> affair in some cases; IIRC calcite could be difficult to handle.
> > >>>>>
> > >>>>> On another note, this would also simplify switching the main repo
> to
> > >>>>> another build system, since you would no longer had to deal with
> > >>>>> relocations, just packaging + merging NOTICE files.
> > >>>>>
> > >>>>> @BowenLi I disagree, flink-shaded does not include any tests,  API
> > >>>>> compatibility checks, checkstyle, layered shading (e.g.,
> > flink-runtime
> > >>>>> and flink-dist, where both relocate dependencies and one is bundled
> > by
> > >>>>> the other), and, most importantly, CI (and really, without CI being
> > >>>>> covered in a PoC there's nothing to discuss).
> > >>>>>
> > >>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote:
> > >>>>>> Speaking of flink-shaded, do we have any idea what the impact of
> > >>> shading
> > >>>>> is on the build time? We could get rid of shading completely in the
> > >>> Flink
> > >>>>> main repository by moving everything that we shade to flink-shaded.
> > >>>>>> Aljoscha
> > >>>>>>
> > >>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <bowenl...@gmail.com> wrote:
> > >>>>>>>
> > >>>>>>> +1 to Till's points on #2 and #5, especially the potential
> > >>>>> non-disruptive,
> > >>>>>>> gradual migration approach if we decide to go that route.
> > >>>>>>>
> > >>>>>>> To add on, I want to point it out that we can actually start with
> > >>>>>>> flink-shaded project [1] which is a perfect candidate for PoC.
> It's
> > >>> of
> > >>>>> much
> > >>>>>>> smaller size, totally isolated from and not interfered with flink
> > >>>>> project
> > >>>>>>> [2], and it actually covers most of our practical feature
> > >>> requirements
> > >>>>> for
> > >>>>>>> a build tool - all making it an ideal experimental field.
> > >>>>>>>
> > >>>>>>> [1] https://github.com/apache/flink-shaded
> > >>>>>>> [2] https://github.com/apache/flink
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <
> > trohrm...@apache.org>
> > >>>>> wrote:
> > >>>>>>>> For the sake of keeping the discussion focused and not
> cluttering
> > >>> the
> > >>>>>>>> discussion thread I would suggest to split the detailed
> reporting
> > >>> for
> > >>>>>>>> reusing JVMs to a separate thread and cross linking it from
> here.
> > >>>>>>>>
> > >>>>>>>> Cheers,
> > >>>>>>>> Till
> > >>>>>>>>
> > >>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <
> > >>> ches...@apache.org>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Update:
> > >>>>>>>>>
> > >>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork
> reuse
> > >>> right
> > >>>>>>>>> away, while flink-tests has the potential for huge savings, but
> > we
> > >>>>> have
> > >>>>>>>>> to figure out some issues first.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Build link:
> https://travis-ci.org/zentol/flink/builds/572659220
> > >>>>>>>>>
> > >>>>>>>>> 4/8 profiles failed.
> > >>>>>>>>>
> > >>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved
> > in
> > >>>>>>>>> libraries (table-planner).
> > >>>>>>>>>
> > >>>>>>>>> The kafka and connectors profiles both fail in kafka tests due
> to
> > >>>>>>>>> producer leaks, and no speed up could be confirmed so far:
> > >>>>>>>>>
> > >>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name:
> > >>>>>>>>> kafka-producer-network-thread | producer-239
> > >>>>>>>>>         at org.junit.Assert.fail(Assert.java:88)
> > >>>>>>>>>         at
> > >>>>>>>>>
> > >>>
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> > >>>>>>>>>         at
> > >>>>>>>>>
> > >>>
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
> > >>>>>>>>> The tests profile failed due to various errors in migration
> > tests:
> > >>>>>>>>>
> > >>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected
> > >>>>>>>> accumulator
> > >>>>>>>>> results within time limit.
> > >>>>>>>>>         at
> > >>>>>>>>>
> > >>>
> >
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
> > >>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one
> > >>> above
> > >>>>>>>>> failed after 19 minutes and is only missing the migration tests
> > >>> (which
> > >>>>>>>>> currently need 6-7 minutes). So we could save somewhere between
> > 15
> > >>> to
> > >>>>> 20
> > >>>>>>>>> minutes here.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Finally, the misc profiles fails in YARN:
> > >>>>>>>>>
> > >>>>>>>>> java.lang.AssertionError
> > >>>>>>>>>         at
> > >>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
> > >>>>>>>>> No significant speedup could be observed in other modules; for
> > >>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
> > >>>>>>>>>
> > >>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
> > >>>>>>>>>> There appears to be a general agreement that 1) should be
> looked
> > >>>>> into;
> > >>>>>>>>>> I've setup a branch with fork reuse being enabled for all
> tests;
> > >>> will
> > >>>>>>>>>> report back the results.
> > >>>>>>>>>>
> > >>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
> > >>>>>>>>>>> Hello everyone,
> > >>>>>>>>>>>
> > >>>>>>>>>>> improving our build times is a hot topic at the moment so
> let's
> > >>>>>>>>>>> discuss the different ways how they could be reduced.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>        Current state:
> > >>>>>>>>>>>
> > >>>>>>>>>>> First up, let's look at some numbers:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1 full build currently consumes 5h of build time total
> ("total
> > >>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time")
> to
> > >>>>>>>>>>> complete from start to finish. The run time may fluctuate of
> > >>> course
> > >>>>>>>>>>> depending on the current Travis load. This applies both to
> > >>> builds on
> > >>>>>>>>>>> the Apache and flink-ci Travis.
> > >>>>>>>>>>>
> > >>>>>>>>>>> At the time of writing, the current queue time for PR jobs
> > >>>>> (reminder:
> > >>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically
> means
> > >>> that
> > >>>>>>>>>>> we are processing builds at the rate that they come in),
> > however
> > >>> we
> > >>>>>>>>>>> are in an admittedly quiet period right now.
> > >>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h
> > as
> > >>>>>>>>>>> everyone was scrambling to get their changes merged in time
> for
> > >>> the
> > >>>>>>>>>>> feature freeze.
> > >>>>>>>>>>>
> > >>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where
> > pending
> > >>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or
> the
> > >>> PR
> > >>>>>>>>>>> was closed, which should prove especially useful during the
> > rush
> > >>>>>>>>>>> hours we see before feature-freezes.)
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>        Past approaches
> > >>>>>>>>>>>
> > >>>>>>>>>>> Over the years we have done rather few things to improve this
> > >>>>>>>>>>> situation (hence our current predicament).
> > >>>>>>>>>>>
> > >>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable
> > >>>>> reduction
> > >>>>>>>>>>> in total build times was the introduction of cron jobs, which
> > >>>>>>>>>>> consolidated the per-commit matrix from 4 configurations
> > >>> (different
> > >>>>>>>>>>> scala/hadoop versions) to 1.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The separation into multiple build profiles was only a
> > >>> work-around
> > >>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has
> the
> > >>>>>>>>>>> obvious potential of reducing run time, but we're currently
> > >>> hitting
> > >>>>> a
> > >>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime,
> > >>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they
> > >>> nearly
> > >>>>>>>>>>> consume an entire profile by themselves (and thus no further
> > >>>>>>>>>>> splitting is possible).
> > >>>>>>>>>>>
> > >>>>>>>>>>> The rework that introduced stages, at the time of
> introduction,
> > >>> did
> > >>>>>>>>>>> also not provide a speed up, although this changed slightly
> > once
> > >>>>> more
> > >>>>>>>>>>> profiles were added and some optimizations to the caching
> have
> > >>> been
> > >>>>>>>>>>> made.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Very recently we modified the surefire-plugin configuration
> for
> > >>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases,
> > >>> providing
> > >>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen
> > any
> > >>>>>>>>>>> negative consequences.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>        Suggestions
> > >>>>>>>>>>>
> > >>>>>>>>>>> This is a list of /all /suggestions for reducing run/total
> > times
> > >>>>> that
> > >>>>>>>>>>> I have seen recently (in other words, they aren't necessarily
> > >>> mine
> > >>>>>>>>>>> nor may I agree with all of them).
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules.
> > >>>>>>>>>>>      * We've seen significant speedups in the blink planner,
> > and
> > >>>>> this
> > >>>>>>>>>>>        should be applicable for all modules. However, I
> presume
> > >>>>>>>> there's
> > >>>>>>>>>>>        a reason why we disabled JVM reuse (information on
> this
> > >>> would
> > >>>>>>>> be
> > >>>>>>>>>>>        appreciated)
> > >>>>>>>>>>> 2. Custom differential build scripts
> > >>>>>>>>>>>      * Setup custom scripts for determining which modules
> > might be
> > >>>>>>>>>>>        affected by change, and manipulate the splits
> > accordingly.
> > >>>>> This
> > >>>>>>>>>>>        approach is conceptually quite straight-forward, but
> has
> > >>>>> limits
> > >>>>>>>>>>>        since it has to be pessimistic; i.e. a change in
> > flink-core
> > >>>>>>>>>>>        _must_ result in testing all modules.
> > >>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on
> > >>>>> demand.
> > >>>>>>>>>>>      * With the introduction of the ci-bot we now have
> > >>> significantly
> > >>>>>>>>>>>        more options on how to handle PR builds. One option
> > could
> > >>> be
> > >>>>> to
> > >>>>>>>>>>>        only run basic tests when the PR is created (which may
> > be
> > >>>>> only
> > >>>>>>>>>>>        modified modules, or all unit tests, or another
> low-cost
> > >>>>>>>>>>>        scheme), and then have a committer trigger other
> builds
> > >>> (full
> > >>>>>>>>>>>        test run, e2e tests, etc...) on demand.
> > >>>>>>>>>>> 4. Move more tests into cron builds
> > >>>>>>>>>>>      * The budget version of 3); move certain tests that are
> > >>> either
> > >>>>>>>>>>>        expensive (like some runtime tests that take minutes)
> > or in
> > >>>>>>>>>>>        rarely modified modules (like gelly) into cron jobs.
> > >>>>>>>>>>> 5. Gradle
> > >>>>>>>>>>>      * Gradle was brought up a few times for it's built-in
> > support
> > >>>>> for
> > >>>>>>>>>>>        differential builds; basically providing 2) without
> the
> > >>>>>>>> overhead
> > >>>>>>>>>>>        of maintaining additional scripts.
> > >>>>>>>>>>>      * To date no PoC was provided that shows it working in
> > our CI
> > >>>>>>>>>>>        environment (i.e., handling splits & caching etc).
> > >>>>>>>>>>>      * This is the most disruptive change by a fair margin,
> as
> > it
> > >>>>>>>> would
> > >>>>>>>>>>>        affect the entire project, developers and potentially
> > users
> > >>>>> (f
> > >>>>>>>>>>>        they build from source).
> > >>>>>>>>>>> 6. CI service
> > >>>>>>>>>>>      * Our current artifact caching setup on Travis is
> > basically a
> > >>>>>>>>>>>        hack; we're basically abusing the Travis cache, which
> is
> > >>>>> meant
> > >>>>>>>>>>>        for long-term caching, to ship build artifacts across
> > jobs.
> > >>>>>>>> It's
> > >>>>>>>>>>>        brittle at times due to timing/visibility issues and
> on
> > >>>>>>>> branches
> > >>>>>>>>>>>        the cleanup processes can interfere with running
> > builds. It
> > >>>>> is
> > >>>>>>>>>>>        also not as effective as it could be.
> > >>>>>>>>>>>      * There are CI services that provide build artifact
> > caching
> > >>> out
> > >>>>>>>> of
> > >>>>>>>>>>>        the box, which could be useful for us.
> > >>>>>>>>>>>      * To date, no PoC for using another CI service has been
> > >>>>> provided.
> > >>>>>>>>>>>
> > >>>>>
> > >>>
> >
> >
>

Re: [DISCUSS] Reducing build times

Reply via email to