Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Chesnay Schepler Thu, 27 Jun 2019 00:31:15 -0700

see https://issues.apache.org/jira/browse/INFRA-18533 for the overalldegradation of Travis capacity.


On 26/06/2019 21:50, Bowen wrote:

just elaborate a bit more on why slow build is ok but no resource is not: Say I 
submit a build request at PST 9am, no other requests exist and mine is the 
queue head, currently it means it still cannot get built until 4 or 5pm.

On Jun 26, 2019, at 12:28, Bowen Li <[email protected]> wrote:

Hi,

@Dawid, I think the "long test running" as I mentioned in the first email, also as you
guys said, belongs to "a big effort which is much harder to accomplish in a short period of
time and may deserve its own separate discussion". Thus I didn't include it in what we can do
in a foreseeable short term.

Besides, I don't think that's the ultimate reason for lack of build resources.
Even if the build is shortened to something like 2h, the problems of no build
machine works about 6 or more hours in PST daytime that I described will still
happen, because no machine from ASF INFRA's pool is allocated to Flink. As I
have paid close attention to the build queue in the past few weekdays, it's a
pretty clear pattern now.

**The ultimate root cause** for that is - we don't have any **dedicated** build
resources that we can stably rely on. I'm actually ok to wait for a long time
if there are build requests running, it means at least we are making progress.
But I'm not ok with no build resource. A better place I think we should aim at
in short term is to always have at least a central pool (can be 3 or 5) of
machines dedicated to build Flink at any time, or maybe use users resources.

@Chesnay @Robert I synced with Jeff offline that Zeppelin community is using a
Jenkins job to automatically build on users' travis account and link the result
back to github PR. I guess the Jenkins job would fetch latest upstream master
and build the PR against it. Jeff has filed tickets to learn and get access to
the Jenkins infra. It'll better to fully understand it first before judging
this approach.

I also heard good things about CircleCI, and ASF INFRA seems to have a pool of
build capacity there too. Can be an alternative to consider.

On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <[email protected]> 
wrote:
Sorry to jump in late, but I think Bowen missed the most important point
from Chesnay's previous message in the summary. The ultimate reason for
all the problems is that the tests take close to 2 hours to run already.
I fully support this claim: "Unless people start caring about test times
before adding them, this issue cannot be solved"

This is also another reason why using user's Travis account won't help.
Every few weeks we reach the user's time limit for a single profile.
This makes the user's builds simply fail, until we either properly
decrease the time the tests take (which I am not sure we ever did) or
postpone the problem by splitting into more profiles. (Note that the ASF
Travis account has higher time limits)

Best,

Dawid

On 26/06/2019 09:36, Robert Metzger wrote:

Do we know if using "the best" available hardware would improve the build
times?
Imagine we would run the build on machines with plenty of main memory to
mount everything to ramdisk + the latest CPU architecture?

Throwing hardware at the problem could help reduce the time of an
individual build, and using our own infrastructure would remove our
dependency on Apache's Travis account (with the obvious downside of having
to maintain the infrastructure)
We could use an open source travis alternative, to have a similar
experience and make the migration easy.


On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[email protected]> wrote:

  From what I gathered, there's no special sauce that the Zeppelin
project uses which actually integrates a users Travis account into the PR.

They just disabled Travis for PRs. And that's kind of it.

Naturally we can do this (duh) and safe the ASF a fair amount of
resources, but there are downsides:

The discoverability of the Travis check takes a nose-dive. Either we
require every contributor to always, an every commit, also post a Travis
build, or we have the reviewer sift through the contributors account to
find it.

This is rather cumbersome. Additionally, it's also not equivalent to
having a PR build.

A normal branch build takes a branch as is and tests it. A PR build
merges the branch into master, and then runs it. (Fun fact: This is why
a PR without merge conflicts is not being run on Travis.)

And ultimately, everyone can already make use of this approach anyway.

On 25/06/2019 08:02, Jark Wu wrote:

Hi Jeff,

Thanks for sharing the Zeppelin approach. I think it's a good idea to
leverage user's travis account.
In this way, we can have almost unlimited concurrent build jobs and
developers can restart build by themselves (currently only committers
can restart PR's build).

But I'm still not very clear how to integrate user's travis build into
the Flink pull request's build automatically. Can you explain more in
detail?

Another question: does travis only build branches for user account?
My concern is that builds for PRs will rebase user's commits against
current master branch.
This will help us to find problems before merge.  Builds for branches
will lose the impact of new commits in master.
How does Zeppelin solve this problem?

Thanks again for sharing the idea.

Regards,
Jark

On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[email protected]
<mailto:[email protected]>> wrote:

     Hi Folks,

     Zeppelin meet this kind of issue before, we solve it by delegating
     each
     one's PR build to his travis account (Everyone can have 5 free
     slot for
     travis build).
     Apache account travis build is only triggered when PR is merged.



     Kurt Young <[email protected] <mailto:[email protected]>>
     于2019年6月25日周二 上午10:16写道：

     > (Forgot to cc George)
     >
     > Best,
     > Kurt
     >
     >
     > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[email protected]
     <mailto:[email protected]>> wrote:
     >
     > > Hi Bowen,
     > >
     > > Thanks for bringing this up. We actually have discussed about
     this, and I
     > > think Till and George have
     > > already spend sometime investigating it. I have cced both of
     them, and
     > > maybe they can share
     > > their findings.
     > >
     > > Best,
     > > Kurt
     > >
     > >
     > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[email protected]
     <mailto:[email protected]>> wrote:
     > >
     > >> Hi Bowen,
     > >>
     > >> Thanks for bringing this. We also suffered from the long
     build time.
     > >> I agree that we should focus on solving build capacity
     problem in the
     > >> thread.
     > >>
     > >> My observation is there is only one build is running, all the
     others
     > >> (other
     > >> PRs, master) are pending.
     > >> The pricing plan[1] of travis shows it can support concurrent
     build
     > jobs.
     > >> But I don't know which plan we are using, might be the free
     plan for
     > open
     > >> source.
     > >>
     > >> I cc-ed Chesnay who may have some experience on Travis.
     > >>
     > >> Regards,
     > >> Jark
     > >>
     > >> [1]: https://travis-ci.com/plans
     > >>
     > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[email protected]
     <mailto:[email protected]>> wrote:
     > >>
     > >> > Hi Steven,
     > >> >
     > >> > I think you may not read what I wrote. The discussion is about
     > "unstable
     > >> > build **capacity**", in another word "unstable / lack of build
     > >> resources",
     > >> > not "unstable build".
     > >> >
     > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
     <[email protected] <mailto:[email protected]>>
     > wrote:
     > >> >
     > >> > > long and sometimes unstable build is definitely a pain

point.

     > >> > >
     > >> > > I suspect the build failure here in flink-connector-kafka
     is not
     > >> related
     > >> > to
     > >> > > my change. but there is no easy re-run the build on
     travis UI.
     > Google
     > >> > > search showed a trick of close-and-open the PR will
     trigger rebuild.
     > >> but
     > >> > > that could add noises to the PR activities.
     > >> > > https://travis-ci.org/apache/flink/jobs/545555519
     > >> > >
     > >> > > travis-ci for my personal repo often failed with
     exceeding time
     > limit
     > >> > after
     > >> > > 4+ hours.
     > >> > > The job exceeded the maximum time limit for jobs, and has
     been
     > >> > terminated.
     > >> > >
     > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
     <[email protected] <mailto:[email protected]>>
     > wrote:
     > >> > >
     > >> > > > https://travis-ci.org/apache/flink/builds/549681530
     This build
     > >> > request
     > >> > > > has
     > >> > > > been sitting at **HEAD of the queue** since I first saw
     it at PST
     > >> > 10:30am
     > >> > > > (not sure how long it's been there before 10:30am).
     It's PST
     > 4:12pm
     > >> now
     > >> > > and
     > >> > > > it hasn't started yet.
     > >> > > >
     > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
     <[email protected] <mailto:[email protected]>>
     > >> wrote:
     > >> > > >
     > >> > > > > Hi devs,
     > >> > > > >
     > >> > > > > I've been experiencing the pain resulting from lack
     of stable
     > >> build
     > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
     noticed
     > >> often
     > >> > > that
     > >> > > > no
     > >> > > > > build in the queue is making any progress for hours, and
     > suddenly
     > >> 5
     > >> > or
     > >> > > 6
     > >> > > > > builds kick off all together after the long pause.
     I'm at PST
     > >> > (UTC-08)
     > >> > > > time
     > >> > > > > zone, and I've seen pause can be as long as 6 hours
     from PST 9am
     > >> to
     > >> > 3pm
     > >> > > > > (let alone the time needed to drain the queue
     afterwards).
     > >> > > > >
     > >> > > > > I think this has greatly impacted our productivity. I've
     > >> experienced
     > >> > > that
     > >> > > > > PRs submitted in the early morning of PST time zone
     won't finish
     > >> > their
     > >> > > > > build until late night of the same day.
     > >> > > > >
     > >> > > > > So my questions are:
     > >> > > > >
     > >> > > > > - Has anyone else experienced the same problem or
     have similar
     > >> > > > observation
     > >> > > > > on TravisCI? (I suspect it has things to do with time
     zone)
     > >> > > > >
     > >> > > > > - What pricing plan of TravisCI is Flink currently
     using? Is it
     > >> the
     > >> > > free
     > >> > > > > plan for open source projects? What are the
     guaranteed build
     > >> capacity
     > >> > > of
     > >> > > > > the current plan?
     > >> > > > >
     > >> > > > > - If the current pricing plan (either free or paid)

can't

     > provide
     > >> > > stable
     > >> > > > > build capacity, can we upgrade to a higher priced
     plan with
     > larger
     > >> > and
     > >> > > > more
     > >> > > > > stable build capacity?
     > >> > > > >
     > >> > > > > BTW, another factor that contribute to the
     productivity problem
     > is
     > >> > that
     > >> > > > > our build is slow - we run full build for every PR and a
     > >> successful
     > >> > > full
     > >> > > > > build takes ~5h. We definitely have more options to
     solve it,
     > for
     > >> > > > instance,
     > >> > > > > modularize the build graphs and reuse artifacts from the
     > previous
     > >> > > build.
     > >> > > > > But I think that can be a big effort which is much
     harder to
     > >> > accomplish
     > >> > > > in
     > >> > > > > a short period of time and may deserve its own separate
     > >> discussion.
     > >> > > > >
     > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
     > >> > > > >
     > >> > > > >
     > >> > > >
     > >> > >
     > >> >
     > >>
     > >
     >


     --
     Best Regards

     Jeff Zhang

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Reply via email to