Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Bowen Wed, 26 Jun 2019 12:50:57 -0700

just elaborate a bit more on why slow build is ok but no resource is not: Say I 
submit a build request at PST 9am, no other requests exist and mine is the 
queue head, currently it means it still cannot get built until 4 or 5pm.




> On Jun 26, 2019, at 12:28, Bowen Li <[email protected]> wrote:
> 
> Hi,
> 
> @Dawid, I think the "long test running" as I mentioned in the first email, 
> also as you guys said, belongs to "a big effort which is much harder to 
> accomplish in a short period of time and may deserve its own separate 
> discussion". Thus I didn't include it in what we can do in a foreseeable 
> short term.
> 
> Besides, I don't think that's the ultimate reason for lack of build 
> resources. Even if the build is shortened to something like 2h, the problems 
> of no build machine works about 6 or more hours in PST daytime that I 
> described will still happen, because no machine from ASF INFRA's pool is 
> allocated to Flink. As I have paid close attention to the build queue in the 
> past few weekdays, it's a pretty clear pattern now. 
> 
> **The ultimate root cause** for that is - we don't have any **dedicated** 
> build resources that we can stably rely on. I'm actually ok to wait for a 
> long time if there are build requests running, it means at least we are 
> making progress. But I'm not ok with no build resource. A better place I 
> think we should aim at in short term is to always have at least a central 
> pool (can be 3 or 5) of machines dedicated to build Flink at any time, or 
> maybe use users resources.
> 
> @Chesnay @Robert I synced with Jeff offline that Zeppelin community is using 
> a Jenkins job to automatically build on users' travis account and link the 
> result back to github PR. I guess the Jenkins job would fetch latest upstream 
> master and build the PR against it. Jeff has filed tickets to learn and get 
> access to the Jenkins infra. It'll better to fully understand it first before 
> judging this approach.
> 
> I also heard good things about CircleCI, and ASF INFRA seems to have a pool 
> of build capacity there too. Can be an alternative to consider.
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <[email protected]> 
>> wrote:
>> Sorry to jump in late, but I think Bowen missed the most important point
>> from Chesnay's previous message in the summary. The ultimate reason for
>> all the problems is that the tests take close to 2 hours to run already.
>> I fully support this claim: "Unless people start caring about test times
>> before adding them, this issue cannot be solved"
>> 
>> This is also another reason why using user's Travis account won't help.
>> Every few weeks we reach the user's time limit for a single profile.
>> This makes the user's builds simply fail, until we either properly
>> decrease the time the tests take (which I am not sure we ever did) or
>> postpone the problem by splitting into more profiles. (Note that the ASF
>> Travis account has higher time limits)
>> 
>> Best,
>> 
>> Dawid
>> 
>> On 26/06/2019 09:36, Robert Metzger wrote:
>> > Do we know if using "the best" available hardware would improve the build
>> > times?
>> > Imagine we would run the build on machines with plenty of main memory to
>> > mount everything to ramdisk + the latest CPU architecture?
>> >
>> > Throwing hardware at the problem could help reduce the time of an
>> > individual build, and using our own infrastructure would remove our
>> > dependency on Apache's Travis account (with the obvious downside of having
>> > to maintain the infrastructure)
>> > We could use an open source travis alternative, to have a similar
>> > experience and make the migration easy.
>> >
>> >
>> > On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[email protected]> 
>> > wrote:
>> >
>> >>  From what I gathered, there's no special sauce that the Zeppelin
>> >> project uses which actually integrates a users Travis account into the PR.
>> >>
>> >> They just disabled Travis for PRs. And that's kind of it.
>> >>
>> >> Naturally we can do this (duh) and safe the ASF a fair amount of
>> >> resources, but there are downsides:
>> >>
>> >> The discoverability of the Travis check takes a nose-dive. Either we
>> >> require every contributor to always, an every commit, also post a Travis
>> >> build, or we have the reviewer sift through the contributors account to
>> >> find it.
>> >>
>> >> This is rather cumbersome. Additionally, it's also not equivalent to
>> >> having a PR build.
>> >>
>> >> A normal branch build takes a branch as is and tests it. A PR build
>> >> merges the branch into master, and then runs it. (Fun fact: This is why
>> >> a PR without merge conflicts is not being run on Travis.)
>> >>
>> >> And ultimately, everyone can already make use of this approach anyway.
>> >>
>> >> On 25/06/2019 08:02, Jark Wu wrote:
>> >>> Hi Jeff,
>> >>>
>> >>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
>> >>> leverage user's travis account.
>> >>> In this way, we can have almost unlimited concurrent build jobs and
>> >>> developers can restart build by themselves (currently only committers
>> >>> can restart PR's build).
>> >>>
>> >>> But I'm still not very clear how to integrate user's travis build into
>> >>> the Flink pull request's build automatically. Can you explain more in
>> >>> detail?
>> >>>
>> >>> Another question: does travis only build branches for user account?
>> >>> My concern is that builds for PRs will rebase user's commits against
>> >>> current master branch.
>> >>> This will help us to find problems before merge.  Builds for branches
>> >>> will lose the impact of new commits in master.
>> >>> How does Zeppelin solve this problem?
>> >>>
>> >>> Thanks again for sharing the idea.
>> >>>
>> >>> Regards,
>> >>> Jark
>> >>>
>> >>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[email protected]
>> >>> <mailto:[email protected]>> wrote:
>> >>>
>> >>>     Hi Folks,
>> >>>
>> >>>     Zeppelin meet this kind of issue before, we solve it by delegating
>> >>>     each
>> >>>     one's PR build to his travis account (Everyone can have 5 free
>> >>>     slot for
>> >>>     travis build).
>> >>>     Apache account travis build is only triggered when PR is merged.
>> >>>
>> >>>
>> >>>
>> >>>     Kurt Young <[email protected] <mailto:[email protected]>>
>> >>>     于2019年6月25日周二 上午10:16写道：
>> >>>
>> >>>     > (Forgot to cc George)
>> >>>     >
>> >>>     > Best,
>> >>>     > Kurt
>> >>>     >
>> >>>     >
>> >>>     > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[email protected]
>> >>>     <mailto:[email protected]>> wrote:
>> >>>     >
>> >>>     > > Hi Bowen,
>> >>>     > >
>> >>>     > > Thanks for bringing this up. We actually have discussed about
>> >>>     this, and I
>> >>>     > > think Till and George have
>> >>>     > > already spend sometime investigating it. I have cced both of
>> >>>     them, and
>> >>>     > > maybe they can share
>> >>>     > > their findings.
>> >>>     > >
>> >>>     > > Best,
>> >>>     > > Kurt
>> >>>     > >
>> >>>     > >
>> >>>     > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[email protected]
>> >>>     <mailto:[email protected]>> wrote:
>> >>>     > >
>> >>>     > >> Hi Bowen,
>> >>>     > >>
>> >>>     > >> Thanks for bringing this. We also suffered from the long
>> >>>     build time.
>> >>>     > >> I agree that we should focus on solving build capacity
>> >>>     problem in the
>> >>>     > >> thread.
>> >>>     > >>
>> >>>     > >> My observation is there is only one build is running, all the
>> >>>     others
>> >>>     > >> (other
>> >>>     > >> PRs, master) are pending.
>> >>>     > >> The pricing plan[1] of travis shows it can support concurrent
>> >>>     build
>> >>>     > jobs.
>> >>>     > >> But I don't know which plan we are using, might be the free
>> >>>     plan for
>> >>>     > open
>> >>>     > >> source.
>> >>>     > >>
>> >>>     > >> I cc-ed Chesnay who may have some experience on Travis.
>> >>>     > >>
>> >>>     > >> Regards,
>> >>>     > >> Jark
>> >>>     > >>
>> >>>     > >> [1]: https://travis-ci.com/plans
>> >>>     > >>
>> >>>     > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[email protected]
>> >>>     <mailto:[email protected]>> wrote:
>> >>>     > >>
>> >>>     > >> > Hi Steven,
>> >>>     > >> >
>> >>>     > >> > I think you may not read what I wrote. The discussion is about
>> >>>     > "unstable
>> >>>     > >> > build **capacity**", in another word "unstable / lack of build
>> >>>     > >> resources",
>> >>>     > >> > not "unstable build".
>> >>>     > >> >
>> >>>     > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
>> >>>     <[email protected] <mailto:[email protected]>>
>> >>>     > wrote:
>> >>>     > >> >
>> >>>     > >> > > long and sometimes unstable build is definitely a pain
>> >> point.
>> >>>     > >> > >
>> >>>     > >> > > I suspect the build failure here in flink-connector-kafka
>> >>>     is not
>> >>>     > >> related
>> >>>     > >> > to
>> >>>     > >> > > my change. but there is no easy re-run the build on
>> >>>     travis UI.
>> >>>     > Google
>> >>>     > >> > > search showed a trick of close-and-open the PR will
>> >>>     trigger rebuild.
>> >>>     > >> but
>> >>>     > >> > > that could add noises to the PR activities.
>> >>>     > >> > > https://travis-ci.org/apache/flink/jobs/545555519
>> >>>     > >> > >
>> >>>     > >> > > travis-ci for my personal repo often failed with
>> >>>     exceeding time
>> >>>     > limit
>> >>>     > >> > after
>> >>>     > >> > > 4+ hours.
>> >>>     > >> > > The job exceeded the maximum time limit for jobs, and has
>> >>>     been
>> >>>     > >> > terminated.
>> >>>     > >> > >
>> >>>     > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
>> >>>     <[email protected] <mailto:[email protected]>>
>> >>>     > wrote:
>> >>>     > >> > >
>> >>>     > >> > > > https://travis-ci.org/apache/flink/builds/549681530
>> >>>     This build
>> >>>     > >> > request
>> >>>     > >> > > > has
>> >>>     > >> > > > been sitting at **HEAD of the queue** since I first saw
>> >>>     it at PST
>> >>>     > >> > 10:30am
>> >>>     > >> > > > (not sure how long it's been there before 10:30am).
>> >>>     It's PST
>> >>>     > 4:12pm
>> >>>     > >> now
>> >>>     > >> > > and
>> >>>     > >> > > > it hasn't started yet.
>> >>>     > >> > > >
>> >>>     > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
>> >>>     <[email protected] <mailto:[email protected]>>
>> >>>     > >> wrote:
>> >>>     > >> > > >
>> >>>     > >> > > > > Hi devs,
>> >>>     > >> > > > >
>> >>>     > >> > > > > I've been experiencing the pain resulting from lack
>> >>>     of stable
>> >>>     > >> build
>> >>>     > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
>> >>>     noticed
>> >>>     > >> often
>> >>>     > >> > > that
>> >>>     > >> > > > no
>> >>>     > >> > > > > build in the queue is making any progress for hours, and
>> >>>     > suddenly
>> >>>     > >> 5
>> >>>     > >> > or
>> >>>     > >> > > 6
>> >>>     > >> > > > > builds kick off all together after the long pause.
>> >>>     I'm at PST
>> >>>     > >> > (UTC-08)
>> >>>     > >> > > > time
>> >>>     > >> > > > > zone, and I've seen pause can be as long as 6 hours
>> >>>     from PST 9am
>> >>>     > >> to
>> >>>     > >> > 3pm
>> >>>     > >> > > > > (let alone the time needed to drain the queue
>> >>>     afterwards).
>> >>>     > >> > > > >
>> >>>     > >> > > > > I think this has greatly impacted our productivity. I've
>> >>>     > >> experienced
>> >>>     > >> > > that
>> >>>     > >> > > > > PRs submitted in the early morning of PST time zone
>> >>>     won't finish
>> >>>     > >> > their
>> >>>     > >> > > > > build until late night of the same day.
>> >>>     > >> > > > >
>> >>>     > >> > > > > So my questions are:
>> >>>     > >> > > > >
>> >>>     > >> > > > > - Has anyone else experienced the same problem or
>> >>>     have similar
>> >>>     > >> > > > observation
>> >>>     > >> > > > > on TravisCI? (I suspect it has things to do with time
>> >>>     zone)
>> >>>     > >> > > > >
>> >>>     > >> > > > > - What pricing plan of TravisCI is Flink currently
>> >>>     using? Is it
>> >>>     > >> the
>> >>>     > >> > > free
>> >>>     > >> > > > > plan for open source projects? What are the
>> >>>     guaranteed build
>> >>>     > >> capacity
>> >>>     > >> > > of
>> >>>     > >> > > > > the current plan?
>> >>>     > >> > > > >
>> >>>     > >> > > > > - If the current pricing plan (either free or paid)
>> >> can't
>> >>>     > provide
>> >>>     > >> > > stable
>> >>>     > >> > > > > build capacity, can we upgrade to a higher priced
>> >>>     plan with
>> >>>     > larger
>> >>>     > >> > and
>> >>>     > >> > > > more
>> >>>     > >> > > > > stable build capacity?
>> >>>     > >> > > > >
>> >>>     > >> > > > > BTW, another factor that contribute to the
>> >>>     productivity problem
>> >>>     > is
>> >>>     > >> > that
>> >>>     > >> > > > > our build is slow - we run full build for every PR and a
>> >>>     > >> successful
>> >>>     > >> > > full
>> >>>     > >> > > > > build takes ~5h. We definitely have more options to
>> >>>     solve it,
>> >>>     > for
>> >>>     > >> > > > instance,
>> >>>     > >> > > > > modularize the build graphs and reuse artifacts from the
>> >>>     > previous
>> >>>     > >> > > build.
>> >>>     > >> > > > > But I think that can be a big effort which is much
>> >>>     harder to
>> >>>     > >> > accomplish
>> >>>     > >> > > > in
>> >>>     > >> > > > > a short period of time and may deserve its own separate
>> >>>     > >> discussion.
>> >>>     > >> > > > >
>> >>>     > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
>> >>>     > >> > > > >
>> >>>     > >> > > > >
>> >>>     > >> > > >
>> >>>     > >> > >
>> >>>     > >> >
>> >>>     > >>
>> >>>     > >
>> >>>     >
>> >>>
>> >>>
>> >>>     --
>> >>>     Best Regards
>> >>>
>> >>>     Jeff Zhang
>> >>>
>> >>
>>

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Reply via email to