see https://issues.apache.org/jira/browse/INFRA-18533 for the overall degradation of Travis capacity.

On 26/06/2019 21:50, Bowen wrote:
just elaborate a bit more on why slow build is ok but no resource is not: Say I 
submit a build request at PST 9am, no other requests exist and mine is the 
queue head, currently it means it still cannot get built until 4 or 5pm.



On Jun 26, 2019, at 12:28, Bowen Li <bowenl...@gmail.com> wrote:

Hi,

@Dawid, I think the "long test running" as I mentioned in the first email, also as you 
guys said, belongs to "a big effort which is much harder to accomplish in a short period of 
time and may deserve its own separate discussion". Thus I didn't include it in what we can do 
in a foreseeable short term.

Besides, I don't think that's the ultimate reason for lack of build resources. 
Even if the build is shortened to something like 2h, the problems of no build 
machine works about 6 or more hours in PST daytime that I described will still 
happen, because no machine from ASF INFRA's pool is allocated to Flink. As I 
have paid close attention to the build queue in the past few weekdays, it's a 
pretty clear pattern now.

**The ultimate root cause** for that is - we don't have any **dedicated** build 
resources that we can stably rely on. I'm actually ok to wait for a long time 
if there are build requests running, it means at least we are making progress. 
But I'm not ok with no build resource. A better place I think we should aim at 
in short term is to always have at least a central pool (can be 3 or 5) of 
machines dedicated to build Flink at any time, or maybe use users resources.

@Chesnay @Robert I synced with Jeff offline that Zeppelin community is using a 
Jenkins job to automatically build on users' travis account and link the result 
back to github PR. I guess the Jenkins job would fetch latest upstream master 
and build the PR against it. Jeff has filed tickets to learn and get access to 
the Jenkins infra. It'll better to fully understand it first before judging 
this approach.

I also heard good things about CircleCI, and ASF INFRA seems to have a pool of 
build capacity there too. Can be an alternative to consider.









On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <dwysakow...@apache.org> 
wrote:
Sorry to jump in late, but I think Bowen missed the most important point
from Chesnay's previous message in the summary. The ultimate reason for
all the problems is that the tests take close to 2 hours to run already.
I fully support this claim: "Unless people start caring about test times
before adding them, this issue cannot be solved"

This is also another reason why using user's Travis account won't help.
Every few weeks we reach the user's time limit for a single profile.
This makes the user's builds simply fail, until we either properly
decrease the time the tests take (which I am not sure we ever did) or
postpone the problem by splitting into more profiles. (Note that the ASF
Travis account has higher time limits)

Best,

Dawid

On 26/06/2019 09:36, Robert Metzger wrote:
Do we know if using "the best" available hardware would improve the build
times?
Imagine we would run the build on machines with plenty of main memory to
mount everything to ramdisk + the latest CPU architecture?

Throwing hardware at the problem could help reduce the time of an
individual build, and using our own infrastructure would remove our
dependency on Apache's Travis account (with the obvious downside of having
to maintain the infrastructure)
We could use an open source travis alternative, to have a similar
experience and make the migration easy.


On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <ches...@apache.org> wrote:

  From what I gathered, there's no special sauce that the Zeppelin
project uses which actually integrates a users Travis account into the PR.

They just disabled Travis for PRs. And that's kind of it.

Naturally we can do this (duh) and safe the ASF a fair amount of
resources, but there are downsides:

The discoverability of the Travis check takes a nose-dive. Either we
require every contributor to always, an every commit, also post a Travis
build, or we have the reviewer sift through the contributors account to
find it.

This is rather cumbersome. Additionally, it's also not equivalent to
having a PR build.

A normal branch build takes a branch as is and tests it. A PR build
merges the branch into master, and then runs it. (Fun fact: This is why
a PR without merge conflicts is not being run on Travis.)

And ultimately, everyone can already make use of this approach anyway.

On 25/06/2019 08:02, Jark Wu wrote:
Hi Jeff,

Thanks for sharing the Zeppelin approach. I think it's a good idea to
leverage user's travis account.
In this way, we can have almost unlimited concurrent build jobs and
developers can restart build by themselves (currently only committers
can restart PR's build).

But I'm still not very clear how to integrate user's travis build into
the Flink pull request's build automatically. Can you explain more in
detail?

Another question: does travis only build branches for user account?
My concern is that builds for PRs will rebase user's commits against
current master branch.
This will help us to find problems before merge.  Builds for branches
will lose the impact of new commits in master.
How does Zeppelin solve this problem?

Thanks again for sharing the idea.

Regards,
Jark

On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <zjf...@gmail.com
<mailto:zjf...@gmail.com>> wrote:

     Hi Folks,

     Zeppelin meet this kind of issue before, we solve it by delegating
     each
     one's PR build to his travis account (Everyone can have 5 free
     slot for
     travis build).
     Apache account travis build is only triggered when PR is merged.



     Kurt Young <ykt...@gmail.com <mailto:ykt...@gmail.com>>
     于2019年6月25日周二 上午10:16写道:

     > (Forgot to cc George)
     >
     > Best,
     > Kurt
     >
     >
     > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <ykt...@gmail.com
     <mailto:ykt...@gmail.com>> wrote:
     >
     > > Hi Bowen,
     > >
     > > Thanks for bringing this up. We actually have discussed about
     this, and I
     > > think Till and George have
     > > already spend sometime investigating it. I have cced both of
     them, and
     > > maybe they can share
     > > their findings.
     > >
     > > Best,
     > > Kurt
     > >
     > >
     > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <imj...@gmail.com
     <mailto:imj...@gmail.com>> wrote:
     > >
     > >> Hi Bowen,
     > >>
     > >> Thanks for bringing this. We also suffered from the long
     build time.
     > >> I agree that we should focus on solving build capacity
     problem in the
     > >> thread.
     > >>
     > >> My observation is there is only one build is running, all the
     others
     > >> (other
     > >> PRs, master) are pending.
     > >> The pricing plan[1] of travis shows it can support concurrent
     build
     > jobs.
     > >> But I don't know which plan we are using, might be the free
     plan for
     > open
     > >> source.
     > >>
     > >> I cc-ed Chesnay who may have some experience on Travis.
     > >>
     > >> Regards,
     > >> Jark
     > >>
     > >> [1]: https://travis-ci.com/plans
     > >>
     > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <bowenl...@gmail.com
     <mailto:bowenl...@gmail.com>> wrote:
     > >>
     > >> > Hi Steven,
     > >> >
     > >> > I think you may not read what I wrote. The discussion is about
     > "unstable
     > >> > build **capacity**", in another word "unstable / lack of build
     > >> resources",
     > >> > not "unstable build".
     > >> >
     > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
     <stevenz...@gmail.com <mailto:stevenz...@gmail.com>>
     > wrote:
     > >> >
     > >> > > long and sometimes unstable build is definitely a pain
point.
     > >> > >
     > >> > > I suspect the build failure here in flink-connector-kafka
     is not
     > >> related
     > >> > to
     > >> > > my change. but there is no easy re-run the build on
     travis UI.
     > Google
     > >> > > search showed a trick of close-and-open the PR will
     trigger rebuild.
     > >> but
     > >> > > that could add noises to the PR activities.
     > >> > > https://travis-ci.org/apache/flink/jobs/545555519
     > >> > >
     > >> > > travis-ci for my personal repo often failed with
     exceeding time
     > limit
     > >> > after
     > >> > > 4+ hours.
     > >> > > The job exceeded the maximum time limit for jobs, and has
     been
     > >> > terminated.
     > >> > >
     > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
     <bowenl...@gmail.com <mailto:bowenl...@gmail.com>>
     > wrote:
     > >> > >
     > >> > > > https://travis-ci.org/apache/flink/builds/549681530
     This build
     > >> > request
     > >> > > > has
     > >> > > > been sitting at **HEAD of the queue** since I first saw
     it at PST
     > >> > 10:30am
     > >> > > > (not sure how long it's been there before 10:30am).
     It's PST
     > 4:12pm
     > >> now
     > >> > > and
     > >> > > > it hasn't started yet.
     > >> > > >
     > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
     <bowenl...@gmail.com <mailto:bowenl...@gmail.com>>
     > >> wrote:
     > >> > > >
     > >> > > > > Hi devs,
     > >> > > > >
     > >> > > > > I've been experiencing the pain resulting from lack
     of stable
     > >> build
     > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
     noticed
     > >> often
     > >> > > that
     > >> > > > no
     > >> > > > > build in the queue is making any progress for hours, and
     > suddenly
     > >> 5
     > >> > or
     > >> > > 6
     > >> > > > > builds kick off all together after the long pause.
     I'm at PST
     > >> > (UTC-08)
     > >> > > > time
     > >> > > > > zone, and I've seen pause can be as long as 6 hours
     from PST 9am
     > >> to
     > >> > 3pm
     > >> > > > > (let alone the time needed to drain the queue
     afterwards).
     > >> > > > >
     > >> > > > > I think this has greatly impacted our productivity. I've
     > >> experienced
     > >> > > that
     > >> > > > > PRs submitted in the early morning of PST time zone
     won't finish
     > >> > their
     > >> > > > > build until late night of the same day.
     > >> > > > >
     > >> > > > > So my questions are:
     > >> > > > >
     > >> > > > > - Has anyone else experienced the same problem or
     have similar
     > >> > > > observation
     > >> > > > > on TravisCI? (I suspect it has things to do with time
     zone)
     > >> > > > >
     > >> > > > > - What pricing plan of TravisCI is Flink currently
     using? Is it
     > >> the
     > >> > > free
     > >> > > > > plan for open source projects? What are the
     guaranteed build
     > >> capacity
     > >> > > of
     > >> > > > > the current plan?
     > >> > > > >
     > >> > > > > - If the current pricing plan (either free or paid)
can't
     > provide
     > >> > > stable
     > >> > > > > build capacity, can we upgrade to a higher priced
     plan with
     > larger
     > >> > and
     > >> > > > more
     > >> > > > > stable build capacity?
     > >> > > > >
     > >> > > > > BTW, another factor that contribute to the
     productivity problem
     > is
     > >> > that
     > >> > > > > our build is slow - we run full build for every PR and a
     > >> successful
     > >> > > full
     > >> > > > > build takes ~5h. We definitely have more options to
     solve it,
     > for
     > >> > > > instance,
     > >> > > > > modularize the build graphs and reuse artifacts from the
     > previous
     > >> > > build.
     > >> > > > > But I think that can be a big effort which is much
     harder to
     > >> > accomplish
     > >> > > > in
     > >> > > > > a short period of time and may deserve its own separate
     > >> discussion.
     > >> > > > >
     > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
     > >> > > > >
     > >> > > > >
     > >> > > >
     > >> > >
     > >> >
     > >>
     > >
     >


     --
     Best Regards

     Jeff Zhang


Reply via email to