How long do we need to run all e2e tests? They are not included in the 3,5 hours I assume.
Cheers, Till On Wed, Sep 4, 2019 at 11:59 AM Robert Metzger <rmetz...@apache.org> wrote: > Yes, we can ensure the same (or better) experience for contributors. > > On the powerful machines, builds finish in 1.5 hours (without any caching > enabled). > > Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a > build for open source projects. Flink needs 3.5 hours on that infra (not > parallelized at all, no caching). These free machines are very similar to > those of Travis, so I expect no build time regressions, if we set it up > similarly. > > > On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler <ches...@apache.org> > wrote: > > > Will using more powerful for the project make it more difficult to > > ensure that contributor builds are still running in a reasonable time? > > > > As an example of this happening on Travis, contributors currently cannot > > run all e2e tests since they timeout, but on apache we have a larger > > timeout. > > > > On 03/09/2019 18:57, Robert Metzger wrote: > > > Hi all, > > > > > > I wanted to give a short update on this: > > > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently > > > working on making all modules compile and test with Gradle. We've also > > > identified some problematic areas (shading being the most obvious one) > > > which we will analyse as part of the PoC. > > > The goal is to see how much Gradle helps to parallelise our build, and > to > > > avoid duplicate work (incremental builds). > > > > > > - I am working on setting up a Flink testing infrastructure based on > > Azure > > > Pipelines, using more powerful hardware. Alibaba kindly provided me > with > > > two 32 core machines (temporarily), and another company reached out to > > > privately, looking into options for cheap, fast machines :) > > > If nobody in the community disagrees, I am going to set up Azure > > Pipelines > > > with our apache/flink GitHub as a build infrastructure that exists next > > to > > > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines > is > > > equally or even more reliable than Travis, and I want to see what the > > > required maintenance work is. > > > On top of that, Azure Pipelines is a very feature-rich tool with a lot > of > > > nice options for us to improve the build experience (statistics about > > tests > > > (flaky tests etc.), nice docker support, plenty of free build resources > > for > > > open source projects, ...) > > > > > > Best, > > > Robert > > > > > > > > > > > > > > > > > > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <rmetz...@apache.org> > > wrote: > > > > > >> Hi all, > > >> > > >> I have summarized all arguments mentioned so far + some additional > > >> research into a Wiki page here: > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279 > > >> > > >> I'm happy to hear further comments on my summary! I'm pretty sure we > can > > >> find more pro's and con's for the different options. > > >> > > >> My opinion after looking at the options: > > >> > > >> - Flink relies on an outdated build tool (Maven), while a good > > >> alternative is well-established (gradle), and will likely provide > a > > much > > >> better CI and local build experience through incremental build and > > cached > > >> intermediates. > > >> Scripting around Maven, or splitting modules / test execution / > > >> repositories won't solve this problem. We should rather spend the > > effort in > > >> migrating to a modern build tool which will provide us benefits in > > the long > > >> run. > > >> - Flink relies on a fairly slow build service (Travis CI), while > > >> simply putting more money onto the problem could cut the build > time > > at > > >> least in half. > > >> We should consider using a build service that provides bigger > > machines > > >> to solve our build time problem. > > >> > > >> My opinion is based on many assumptions (gradle is actually as fast as > > >> promised (haven't used it before), we can build Flink with gradle, we > > find > > >> sponsors for bigger build machines) that we need to test first through > > PoCs. > > >> > > >> Best, > > >> Robert > > >> > > >> > > >> > > >> > > >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek < > aljos...@apache.org> > > >> wrote: > > >> > > >>> I did a quick test: a normal "mvn clean install -DskipTests > > >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my > > machine > > >>> takes about 14 minutes. After removing all mentions of > > maven-shade-plugin > > >>> the build time goes down to roughly 11.5 minutes. (Obviously the > > resulting > > >>> Flink won’t work, because some expected stuff is not packaged and > most > > of > > >>> the end-to-end tests use the shade plugin to package the jars for > > testing. > > >>> > > >>> Aljoscha > > >>> > > >>>> On 18. Aug 2019, at 19:52, Robert Metzger <rmetz...@apache.org> > > wrote: > > >>>> > > >>>> Hi all, > > >>>> > > >>>> I wanted to understand the impact of the hardware we are using for > > >>> running > > >>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory > > >>> [1]. > > >>>> They are using Google Cloud Compute Engine *n1-standard-2* > instances. > > >>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine > > >>> type. > > >>>> Running the same workload on a 32 virtual cores, 64 gb machine, > takes > > >>> *1:21 > > >>>> h*. > > >>>> > > >>>> What is interesting are the per-module build time differences. > > >>>> Modules which are parallelizing tests well greatly benefit from the > > >>>> additional cores: > > >>>> "flink-tests" 36:51 min vs 4:33 min > > >>>> "flink-runtime" 23:41 min vs 3:47 min > > >>>> "flink-table-planner" 15:54 min vs 3:13 min > > >>>> > > >>>> On the other hand, we have modules which are not parallel at all: > > >>>> "flink-connector-kafka": 16:32 min vs 15:19 min > > >>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min > > >>>> Also, the checkstyle plugin is not scaling at all. > > >>>> > > >>>> Chesnay reported some significant speedups by reusing forks. > > >>>> I don't know how much effort it would be to make the Kafka tests > > >>>> parallelizable. In total, they currently use 30 minutes on the big > > >>> machine > > >>>> (while 31 CPUs are idling :) ) > > >>>> > > >>>> Let me know what you think about these results. If the community is > > >>>> generally interested in further investigating into that direction, I > > >>> could > > >>>> look into software to orchestrate this, as well as sponsors for such > > an > > >>>> infrastructure. > > >>>> > > >>>> [1] https://docs.travis-ci.com/user/reference/overview/ > > >>>> > > >>>> > > >>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler < > ches...@apache.org> > > >>> wrote: > > >>>>> @Aljoscha Shading takes a few minutes for a full build; you can see > > >>> this > > >>>>> quite easily by looking at the compile step in the misc profile > > >>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules > > that > > >>>>> longer than a fraction of a section are usually caused by shading > > lots > > >>>>> of classes. Note that I cannot tell you how much of this is spent > on > > >>>>> relocations, and how much on writing the jar. > > >>>>> > > >>>>> Personally, I'd very much like us to move all shading to > > flink-shaded; > > >>>>> this would finally allows us to use newer maven versions without > > >>> needing > > >>>>> cumbersome workarounds for flink-dist. However, this isn't a > trivial > > >>>>> affair in some cases; IIRC calcite could be difficult to handle. > > >>>>> > > >>>>> On another note, this would also simplify switching the main repo > to > > >>>>> another build system, since you would no longer had to deal with > > >>>>> relocations, just packaging + merging NOTICE files. > > >>>>> > > >>>>> @BowenLi I disagree, flink-shaded does not include any tests, API > > >>>>> compatibility checks, checkstyle, layered shading (e.g., > > flink-runtime > > >>>>> and flink-dist, where both relocate dependencies and one is bundled > > by > > >>>>> the other), and, most importantly, CI (and really, without CI being > > >>>>> covered in a PoC there's nothing to discuss). > > >>>>> > > >>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote: > > >>>>>> Speaking of flink-shaded, do we have any idea what the impact of > > >>> shading > > >>>>> is on the build time? We could get rid of shading completely in the > > >>> Flink > > >>>>> main repository by moving everything that we shade to flink-shaded. > > >>>>>> Aljoscha > > >>>>>> > > >>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <bowenl...@gmail.com> wrote: > > >>>>>>> > > >>>>>>> +1 to Till's points on #2 and #5, especially the potential > > >>>>> non-disruptive, > > >>>>>>> gradual migration approach if we decide to go that route. > > >>>>>>> > > >>>>>>> To add on, I want to point it out that we can actually start with > > >>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. > It's > > >>> of > > >>>>> much > > >>>>>>> smaller size, totally isolated from and not interfered with flink > > >>>>> project > > >>>>>>> [2], and it actually covers most of our practical feature > > >>> requirements > > >>>>> for > > >>>>>>> a build tool - all making it an ideal experimental field. > > >>>>>>> > > >>>>>>> [1] https://github.com/apache/flink-shaded > > >>>>>>> [2] https://github.com/apache/flink > > >>>>>>> > > >>>>>>> > > >>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann < > > trohrm...@apache.org> > > >>>>> wrote: > > >>>>>>>> For the sake of keeping the discussion focused and not > cluttering > > >>> the > > >>>>>>>> discussion thread I would suggest to split the detailed > reporting > > >>> for > > >>>>>>>> reusing JVMs to a separate thread and cross linking it from > here. > > >>>>>>>> > > >>>>>>>> Cheers, > > >>>>>>>> Till > > >>>>>>>> > > >>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler < > > >>> ches...@apache.org> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> Update: > > >>>>>>>>> > > >>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork > reuse > > >>> right > > >>>>>>>>> away, while flink-tests has the potential for huge savings, but > > we > > >>>>> have > > >>>>>>>>> to figure out some issues first. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Build link: > https://travis-ci.org/zentol/flink/builds/572659220 > > >>>>>>>>> > > >>>>>>>>> 4/8 profiles failed. > > >>>>>>>>> > > >>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved > > in > > >>>>>>>>> libraries (table-planner). > > >>>>>>>>> > > >>>>>>>>> The kafka and connectors profiles both fail in kafka tests due > to > > >>>>>>>>> producer leaks, and no speed up could be confirmed so far: > > >>>>>>>>> > > >>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name: > > >>>>>>>>> kafka-producer-network-thread | producer-239 > > >>>>>>>>> at org.junit.Assert.fail(Assert.java:88) > > >>>>>>>>> at > > >>>>>>>>> > > >>> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677) > > >>>>>>>>> at > > >>>>>>>>> > > >>> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210) > > >>>>>>>>> The tests profile failed due to various errors in migration > > tests: > > >>>>>>>>> > > >>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected > > >>>>>>>> accumulator > > >>>>>>>>> results within time limit. > > >>>>>>>>> at > > >>>>>>>>> > > >>> > > > org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141) > > >>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one > > >>> above > > >>>>>>>>> failed after 19 minutes and is only missing the migration tests > > >>> (which > > >>>>>>>>> currently need 6-7 minutes). So we could save somewhere between > > 15 > > >>> to > > >>>>> 20 > > >>>>>>>>> minutes here. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Finally, the misc profiles fails in YARN: > > >>>>>>>>> > > >>>>>>>>> java.lang.AssertionError > > >>>>>>>>> at > > >>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64) > > >>>>>>>>> No significant speedup could be observed in other modules; for > > >>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it. > > >>>>>>>>> > > >>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote: > > >>>>>>>>>> There appears to be a general agreement that 1) should be > looked > > >>>>> into; > > >>>>>>>>>> I've setup a branch with fork reuse being enabled for all > tests; > > >>> will > > >>>>>>>>>> report back the results. > > >>>>>>>>>> > > >>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote: > > >>>>>>>>>>> Hello everyone, > > >>>>>>>>>>> > > >>>>>>>>>>> improving our build times is a hot topic at the moment so > let's > > >>>>>>>>>>> discuss the different ways how they could be reduced. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Current state: > > >>>>>>>>>>> > > >>>>>>>>>>> First up, let's look at some numbers: > > >>>>>>>>>>> > > >>>>>>>>>>> 1 full build currently consumes 5h of build time total > ("total > > >>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") > to > > >>>>>>>>>>> complete from start to finish. The run time may fluctuate of > > >>> course > > >>>>>>>>>>> depending on the current Travis load. This applies both to > > >>> builds on > > >>>>>>>>>>> the Apache and flink-ci Travis. > > >>>>>>>>>>> > > >>>>>>>>>>> At the time of writing, the current queue time for PR jobs > > >>>>> (reminder: > > >>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically > means > > >>> that > > >>>>>>>>>>> we are processing builds at the rate that they come in), > > however > > >>> we > > >>>>>>>>>>> are in an admittedly quiet period right now. > > >>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h > > as > > >>>>>>>>>>> everyone was scrambling to get their changes merged in time > for > > >>> the > > >>>>>>>>>>> feature freeze. > > >>>>>>>>>>> > > >>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where > > pending > > >>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or > the > > >>> PR > > >>>>>>>>>>> was closed, which should prove especially useful during the > > rush > > >>>>>>>>>>> hours we see before feature-freezes.) > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Past approaches > > >>>>>>>>>>> > > >>>>>>>>>>> Over the years we have done rather few things to improve this > > >>>>>>>>>>> situation (hence our current predicament). > > >>>>>>>>>>> > > >>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable > > >>>>> reduction > > >>>>>>>>>>> in total build times was the introduction of cron jobs, which > > >>>>>>>>>>> consolidated the per-commit matrix from 4 configurations > > >>> (different > > >>>>>>>>>>> scala/hadoop versions) to 1. > > >>>>>>>>>>> > > >>>>>>>>>>> The separation into multiple build profiles was only a > > >>> work-around > > >>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has > the > > >>>>>>>>>>> obvious potential of reducing run time, but we're currently > > >>> hitting > > >>>>> a > > >>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime, > > >>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they > > >>> nearly > > >>>>>>>>>>> consume an entire profile by themselves (and thus no further > > >>>>>>>>>>> splitting is possible). > > >>>>>>>>>>> > > >>>>>>>>>>> The rework that introduced stages, at the time of > introduction, > > >>> did > > >>>>>>>>>>> also not provide a speed up, although this changed slightly > > once > > >>>>> more > > >>>>>>>>>>> profiles were added and some optimizations to the caching > have > > >>> been > > >>>>>>>>>>> made. > > >>>>>>>>>>> > > >>>>>>>>>>> Very recently we modified the surefire-plugin configuration > for > > >>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, > > >>> providing > > >>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen > > any > > >>>>>>>>>>> negative consequences. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Suggestions > > >>>>>>>>>>> > > >>>>>>>>>>> This is a list of /all /suggestions for reducing run/total > > times > > >>>>> that > > >>>>>>>>>>> I have seen recently (in other words, they aren't necessarily > > >>> mine > > >>>>>>>>>>> nor may I agree with all of them). > > >>>>>>>>>>> > > >>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules. > > >>>>>>>>>>> * We've seen significant speedups in the blink planner, > > and > > >>>>> this > > >>>>>>>>>>> should be applicable for all modules. However, I > presume > > >>>>>>>> there's > > >>>>>>>>>>> a reason why we disabled JVM reuse (information on > this > > >>> would > > >>>>>>>> be > > >>>>>>>>>>> appreciated) > > >>>>>>>>>>> 2. Custom differential build scripts > > >>>>>>>>>>> * Setup custom scripts for determining which modules > > might be > > >>>>>>>>>>> affected by change, and manipulate the splits > > accordingly. > > >>>>> This > > >>>>>>>>>>> approach is conceptually quite straight-forward, but > has > > >>>>> limits > > >>>>>>>>>>> since it has to be pessimistic; i.e. a change in > > flink-core > > >>>>>>>>>>> _must_ result in testing all modules. > > >>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on > > >>>>> demand. > > >>>>>>>>>>> * With the introduction of the ci-bot we now have > > >>> significantly > > >>>>>>>>>>> more options on how to handle PR builds. One option > > could > > >>> be > > >>>>> to > > >>>>>>>>>>> only run basic tests when the PR is created (which may > > be > > >>>>> only > > >>>>>>>>>>> modified modules, or all unit tests, or another > low-cost > > >>>>>>>>>>> scheme), and then have a committer trigger other > builds > > >>> (full > > >>>>>>>>>>> test run, e2e tests, etc...) on demand. > > >>>>>>>>>>> 4. Move more tests into cron builds > > >>>>>>>>>>> * The budget version of 3); move certain tests that are > > >>> either > > >>>>>>>>>>> expensive (like some runtime tests that take minutes) > > or in > > >>>>>>>>>>> rarely modified modules (like gelly) into cron jobs. > > >>>>>>>>>>> 5. Gradle > > >>>>>>>>>>> * Gradle was brought up a few times for it's built-in > > support > > >>>>> for > > >>>>>>>>>>> differential builds; basically providing 2) without > the > > >>>>>>>> overhead > > >>>>>>>>>>> of maintaining additional scripts. > > >>>>>>>>>>> * To date no PoC was provided that shows it working in > > our CI > > >>>>>>>>>>> environment (i.e., handling splits & caching etc). > > >>>>>>>>>>> * This is the most disruptive change by a fair margin, > as > > it > > >>>>>>>> would > > >>>>>>>>>>> affect the entire project, developers and potentially > > users > > >>>>> (f > > >>>>>>>>>>> they build from source). > > >>>>>>>>>>> 6. CI service > > >>>>>>>>>>> * Our current artifact caching setup on Travis is > > basically a > > >>>>>>>>>>> hack; we're basically abusing the Travis cache, which > is > > >>>>> meant > > >>>>>>>>>>> for long-term caching, to ship build artifacts across > > jobs. > > >>>>>>>> It's > > >>>>>>>>>>> brittle at times due to timing/visibility issues and > on > > >>>>>>>> branches > > >>>>>>>>>>> the cleanup processes can interfere with running > > builds. It > > >>>>> is > > >>>>>>>>>>> also not as effective as it could be. > > >>>>>>>>>>> * There are CI services that provide build artifact > > caching > > >>> out > > >>>>>>>> of > > >>>>>>>>>>> the box, which could be useful for us. > > >>>>>>>>>>> * To date, no PoC for using another CI service has been > > >>>>> provided. > > >>>>>>>>>>> > > >>>>> > > >>> > > > > >