Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Bowen Li Mon, 01 Jul 2019 20:13:09 -0700

By looking at the git history of the Jenkins script, its core part was
finished in March 2017 (and only two minor update in 2017/2018), so it's
been running for over two years now and feels like Zepplin community has
been quite happy with it. @Jeff Zhang <zjf...@gmail.com> can you share your
insights and user experience with the Jenkins+Travis approach?


Things like:

- has the approach completely solved the resource capacity problem for
Zepplin community? is Zepplin community happy with the result?
- is the whole configuration chain stable (e.g. uptime) enough?
- how often do you need to maintain the Jenkins infra? how many people are
usually involved in maintenance and bug-fixes?

The downside of this approach seems mostly to be on the maintenance to me -
maintain the script and Jenkins infra.

** Having Our Own Travis-CI.com Account **

Another alternative I've been thinking of is to have our own travis-ci.com
account with paid dedicated resources. Note travis-ci.org is the free
version and travis-ci.com is the commercial version. We currently use a
shared resource pool managed by ASK INFRA team on travis-ci.org, but we
have no control over it - we can't see how it's configured, how much
resources are available, how resources are allocated among Apache projects,
etc. The nice thing about having an account on travis-ci.com are:

- relatively low cost with much better resource guarantee than what we
currently have [1]: $249/month with 5 dedicated concurrency, $489/month
with 10 concurrency
- low maintenance work compared to using Jenkins
- (potentially) no migration cost according to Travis's doc [2] (pending
verification)
- full control over the build capacity/configuration compared to using ASF
INFRA's pool

I'd be surprised if we as such a vibrant community cannot find and fund
$249*12=$2988 a year in exchange for a much better developer experience and
much higher productivity.

[1] https://travis-ci.com/plans
[2] https://docs.travis-ci.com/user/migrate/open-source-repository-migration

On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <ches...@apache.org> wrote:

> So yes, the Jenkins job keeps pulling the state from Travis until it
> finishes.
>
> Note sure I'm comfortable with the idea of using Jenkins workers just to
> idle for a several hours.
>
> On 29/06/2019 14:56, Jeff Zhang wrote:
> > Here's what zeppelin community did, we make a python script to check the
> > build status of pull request.
> > Here's script:
> > https://github.com/apache/zeppelin/blob/master/travis_check.py
> >
> > And this is the script we used in Jenkins build job.
> >
> > if [ -f "travis_check.py" ]; then
> >    git log -n 1
> >    STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull request.*from.*" |
> sed
> > 's/.*GitHub pull request <a
> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
> >    AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g')
> >    PR=$(echo $STATUS | awk '{print $1}' | sed 's/.*[/]\(.*\)$/\1/g')
> >    #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}')
> >    #if [ -z $COMMIT ]; then
> >    #  COMMIT=$(curl -s
> https://api.github.com/repos/apache/zeppelin/pulls/$PR
> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:"
> |
> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
> >    #fi
> >
> >    # get commit hash from PR
> >    COMMIT=$(curl -s
> https://api.github.com/repos/apache/zeppelin/pulls/$PR |
> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:"
> |
> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
> >    sleep 30 # sleep few moment to wait travis starts the build
> >    RET_CODE=0
> >    python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
> >    if [ $RET_CODE -eq 2 ]; then # try with repository name when
> travis-ci is
> > not available in the account
> >      RET_CODE=0
> >      AUTHOR=$(curl -s
> https://api.github.com/repos/apache/zeppelin/pulls/$PR
> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed
> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g')
> >    python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
> >    fi
> >
> >    if [ $RET_CODE -eq 2 ]; then # fail with can't find build information
> in
> > the travis
> >      set +x
> >      echo "-----------------------------------------------------"
> >      echo "Looks like travis-ci is not configured for your fork."
> >      echo "Please setup by swich on 'zeppelin' repository at
> > https://travis-ci.org/profile and travis-ci."
> >      echo "And then make sure 'Build branch updates' option is enabled in
> > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings.";
> >      echo ""
> >      echo "To trigger CI after setup, you will need ammend your last
> commit
> > with"
> >      echo "git commit --amend"
> >      echo "git push your-remote HEAD --force"
> >      echo ""
> >      echo "See
> >
> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration
> > ."
> >    fi
> >
> >    exit $RET_CODE
> > else
> >    set +x
> >    echo "travis_check.py does not exists"
> >    exit 1
> > fi
> >
> > Chesnay Schepler <ches...@apache.org> 于2019年6月29日周六 下午3:17写道：
> >
> >> Does this imply that a Jenkins job is active as long as the Travis build
> >> runs?
> >>
> >> On 26/06/2019 21:28, Bowen Li wrote:
> >>> Hi,
> >>>
> >>> @Dawid, I think the "long test running" as I mentioned in the first
> >> email,
> >>> also as you guys said, belongs to "a big effort which is much harder to
> >>> accomplish in a short period of time and may deserve its own separate
> >>> discussion". Thus I didn't include it in what we can do in a
> foreseeable
> >>> short term.
> >>>
> >>> Besides, I don't think that's the ultimate reason for lack of build
> >>> resources. Even if the build is shortened to something like 2h, the
> >>> problems of no build machine works about 6 or more hours in PST daytime
> >>> that I described will still happen, because no machine from ASF INFRA's
> >>> pool is allocated to Flink. As I have paid close attention to the build
> >>> queue in the past few weekdays, it's a pretty clear pattern now.
> >>>
> >>> **The ultimate root cause** for that is - we don't have any
> **dedicated**
> >>> build resources that we can stably rely on. I'm actually ok to wait
> for a
> >>> long time if there are build requests running, it means at least we are
> >>> making progress. But I'm not ok with no build resource. A better place
> I
> >>> think we should aim at in short term is to always have at least a
> central
> >>> pool (can be 3 or 5) of machines dedicated to build Flink at any time,
> or
> >>> maybe use users resources.
> >>>
> >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin community is
> >>> using a Jenkins job to automatically build on users' travis account and
> >>> link the result back to github PR. I guess the Jenkins job would fetch
> >>> latest upstream master and build the PR against it. Jeff has filed
> >> tickets
> >>> to learn and get access to the Jenkins infra. It'll better to fully
> >>> understand it first before judging this approach.
> >>>
> >>> I also heard good things about CircleCI, and ASF INFRA seems to have a
> >> pool
> >>> of build capacity there too. Can be an alternative to consider.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <
> >> dwysakow...@apache.org>
> >>> wrote:
> >>>
> >>>> Sorry to jump in late, but I think Bowen missed the most important
> point
> >>>> from Chesnay's previous message in the summary. The ultimate reason
> for
> >>>> all the problems is that the tests take close to 2 hours to run
> already.
> >>>> I fully support this claim: "Unless people start caring about test
> times
> >>>> before adding them, this issue cannot be solved"
> >>>>
> >>>> This is also another reason why using user's Travis account won't
> help.
> >>>> Every few weeks we reach the user's time limit for a single profile.
> >>>> This makes the user's builds simply fail, until we either properly
> >>>> decrease the time the tests take (which I am not sure we ever did) or
> >>>> postpone the problem by splitting into more profiles. (Note that the
> ASF
> >>>> Travis account has higher time limits)
> >>>>
> >>>> Best,
> >>>>
> >>>> Dawid
> >>>>
> >>>> On 26/06/2019 09:36, Robert Metzger wrote:
> >>>>> Do we know if using "the best" available hardware would improve the
> >> build
> >>>>> times?
> >>>>> Imagine we would run the build on machines with plenty of main memory
> >> to
> >>>>> mount everything to ramdisk + the latest CPU architecture?
> >>>>>
> >>>>> Throwing hardware at the problem could help reduce the time of an
> >>>>> individual build, and using our own infrastructure would remove our
> >>>>> dependency on Apache's Travis account (with the obvious downside of
> >>>> having
> >>>>> to maintain the infrastructure)
> >>>>> We could use an open source travis alternative, to have a similar
> >>>>> experience and make the migration easy.
> >>>>>
> >>>>>
> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <ches...@apache.org
> >
> >>>> wrote:
> >>>>>>    From what I gathered, there's no special sauce that the Zeppelin
> >>>>>> project uses which actually integrates a users Travis account into
> the
> >>>> PR.
> >>>>>> They just disabled Travis for PRs. And that's kind of it.
> >>>>>>
> >>>>>> Naturally we can do this (duh) and safe the ASF a fair amount of
> >>>>>> resources, but there are downsides:
> >>>>>>
> >>>>>> The discoverability of the Travis check takes a nose-dive. Either we
> >>>>>> require every contributor to always, an every commit, also post a
> >> Travis
> >>>>>> build, or we have the reviewer sift through the contributors account
> >> to
> >>>>>> find it.
> >>>>>>
> >>>>>> This is rather cumbersome. Additionally, it's also not equivalent to
> >>>>>> having a PR build.
> >>>>>>
> >>>>>> A normal branch build takes a branch as is and tests it. A PR build
> >>>>>> merges the branch into master, and then runs it. (Fun fact: This is
> >> why
> >>>>>> a PR without merge conflicts is not being run on Travis.)
> >>>>>>
> >>>>>> And ultimately, everyone can already make use of this approach
> anyway.
> >>>>>>
> >>>>>> On 25/06/2019 08:02, Jark Wu wrote:
> >>>>>>> Hi Jeff,
> >>>>>>>
> >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a good idea
> to
> >>>>>>> leverage user's travis account.
> >>>>>>> In this way, we can have almost unlimited concurrent build jobs and
> >>>>>>> developers can restart build by themselves (currently only
> committers
> >>>>>>> can restart PR's build).
> >>>>>>>
> >>>>>>> But I'm still not very clear how to integrate user's travis build
> >> into
> >>>>>>> the Flink pull request's build automatically. Can you explain more
> in
> >>>>>>> detail?
> >>>>>>>
> >>>>>>> Another question: does travis only build branches for user account?
> >>>>>>> My concern is that builds for PRs will rebase user's commits
> against
> >>>>>>> current master branch.
> >>>>>>> This will help us to find problems before merge.  Builds for
> branches
> >>>>>>> will lose the impact of new commits in master.
> >>>>>>> How does Zeppelin solve this problem?
> >>>>>>>
> >>>>>>> Thanks again for sharing the idea.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Jark
> >>>>>>>
> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <zjf...@gmail.com
> >>>>>>> <mailto:zjf...@gmail.com>> wrote:
> >>>>>>>
> >>>>>>>       Hi Folks,
> >>>>>>>
> >>>>>>>       Zeppelin meet this kind of issue before, we solve it by
> >> delegating
> >>>>>>>       each
> >>>>>>>       one's PR build to his travis account (Everyone can have 5
> free
> >>>>>>>       slot for
> >>>>>>>       travis build).
> >>>>>>>       Apache account travis build is only triggered when PR is
> merged.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>       Kurt Young <ykt...@gmail.com <mailto:ykt...@gmail.com>>
> >>>>>>>       于2019年6月25日周二 上午10:16写道：
> >>>>>>>
> >>>>>>>       > (Forgot to cc George)
> >>>>>>>       >
> >>>>>>>       > Best,
> >>>>>>>       > Kurt
> >>>>>>>       >
> >>>>>>>       >
> >>>>>>>       > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <
> ykt...@gmail.com
> >>>>>>>       <mailto:ykt...@gmail.com>> wrote:
> >>>>>>>       >
> >>>>>>>       > > Hi Bowen,
> >>>>>>>       > >
> >>>>>>>       > > Thanks for bringing this up. We actually have discussed
> >> about
> >>>>>>>       this, and I
> >>>>>>>       > > think Till and George have
> >>>>>>>       > > already spend sometime investigating it. I have cced
> both of
> >>>>>>>       them, and
> >>>>>>>       > > maybe they can share
> >>>>>>>       > > their findings.
> >>>>>>>       > >
> >>>>>>>       > > Best,
> >>>>>>>       > > Kurt
> >>>>>>>       > >
> >>>>>>>       > >
> >>>>>>>       > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <
> imj...@gmail.com
> >>>>>>>       <mailto:imj...@gmail.com>> wrote:
> >>>>>>>       > >
> >>>>>>>       > >> Hi Bowen,
> >>>>>>>       > >>
> >>>>>>>       > >> Thanks for bringing this. We also suffered from the long
> >>>>>>>       build time.
> >>>>>>>       > >> I agree that we should focus on solving build capacity
> >>>>>>>       problem in the
> >>>>>>>       > >> thread.
> >>>>>>>       > >>
> >>>>>>>       > >> My observation is there is only one build is running,
> all
> >> the
> >>>>>>>       others
> >>>>>>>       > >> (other
> >>>>>>>       > >> PRs, master) are pending.
> >>>>>>>       > >> The pricing plan[1] of travis shows it can support
> >> concurrent
> >>>>>>>       build
> >>>>>>>       > jobs.
> >>>>>>>       > >> But I don't know which plan we are using, might be the
> free
> >>>>>>>       plan for
> >>>>>>>       > open
> >>>>>>>       > >> source.
> >>>>>>>       > >>
> >>>>>>>       > >> I cc-ed Chesnay who may have some experience on Travis.
> >>>>>>>       > >>
> >>>>>>>       > >> Regards,
> >>>>>>>       > >> Jark
> >>>>>>>       > >>
> >>>>>>>       > >> [1]: https://travis-ci.com/plans
> >>>>>>>       > >>
> >>>>>>>       > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <
> >> bowenl...@gmail.com
> >>>>>>>       <mailto:bowenl...@gmail.com>> wrote:
> >>>>>>>       > >>
> >>>>>>>       > >> > Hi Steven,
> >>>>>>>       > >> >
> >>>>>>>       > >> > I think you may not read what I wrote. The discussion
> is
> >>>> about
> >>>>>>>       > "unstable
> >>>>>>>       > >> > build **capacity**", in another word "unstable / lack
> of
> >>>> build
> >>>>>>>       > >> resources",
> >>>>>>>       > >> > not "unstable build".
> >>>>>>>       > >> >
> >>>>>>>       > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
> >>>>>>>       <stevenz...@gmail.com <mailto:stevenz...@gmail.com>>
> >>>>>>>       > wrote:
> >>>>>>>       > >> >
> >>>>>>>       > >> > > long and sometimes unstable build is definitely a
> pain
> >>>>>> point.
> >>>>>>>       > >> > >
> >>>>>>>       > >> > > I suspect the build failure here in
> >> flink-connector-kafka
> >>>>>>>       is not
> >>>>>>>       > >> related
> >>>>>>>       > >> > to
> >>>>>>>       > >> > > my change. but there is no easy re-run the build on
> >>>>>>>       travis UI.
> >>>>>>>       > Google
> >>>>>>>       > >> > > search showed a trick of close-and-open the PR will
> >>>>>>>       trigger rebuild.
> >>>>>>>       > >> but
> >>>>>>>       > >> > > that could add noises to the PR activities.
> >>>>>>>       > >> > > https://travis-ci.org/apache/flink/jobs/545555519
> >>>>>>>       > >> > >
> >>>>>>>       > >> > > travis-ci for my personal repo often failed with
> >>>>>>>       exceeding time
> >>>>>>>       > limit
> >>>>>>>       > >> > after
> >>>>>>>       > >> > > 4+ hours.
> >>>>>>>       > >> > > The job exceeded the maximum time limit for jobs,
> and
> >> has
> >>>>>>>       been
> >>>>>>>       > >> > terminated.
> >>>>>>>       > >> > >
> >>>>>>>       > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
> >>>>>>>       <bowenl...@gmail.com <mailto:bowenl...@gmail.com>>
> >>>>>>>       > wrote:
> >>>>>>>       > >> > >
> >>>>>>>       > >> > > >
> https://travis-ci.org/apache/flink/builds/549681530
> >>>>>>>       This build
> >>>>>>>       > >> > request
> >>>>>>>       > >> > > > has
> >>>>>>>       > >> > > > been sitting at **HEAD of the queue** since I
> first
> >> saw
> >>>>>>>       it at PST
> >>>>>>>       > >> > 10:30am
> >>>>>>>       > >> > > > (not sure how long it's been there before
> 10:30am).
> >>>>>>>       It's PST
> >>>>>>>       > 4:12pm
> >>>>>>>       > >> now
> >>>>>>>       > >> > > and
> >>>>>>>       > >> > > > it hasn't started yet.
> >>>>>>>       > >> > > >
> >>>>>>>       > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
> >>>>>>>       <bowenl...@gmail.com <mailto:bowenl...@gmail.com>>
> >>>>>>>       > >> wrote:
> >>>>>>>       > >> > > >
> >>>>>>>       > >> > > > > Hi devs,
> >>>>>>>       > >> > > > >
> >>>>>>>       > >> > > > > I've been experiencing the pain resulting from
> lack
> >>>>>>>       of stable
> >>>>>>>       > >> build
> >>>>>>>       > >> > > > > capacity on Travis for Flink PRs [1].
> >> Specifically, I
> >>>>>>>       noticed
> >>>>>>>       > >> often
> >>>>>>>       > >> > > that
> >>>>>>>       > >> > > > no
> >>>>>>>       > >> > > > > build in the queue is making any progress for
> >> hours,
> >>>> and
> >>>>>>>       > suddenly
> >>>>>>>       > >> 5
> >>>>>>>       > >> > or
> >>>>>>>       > >> > > 6
> >>>>>>>       > >> > > > > builds kick off all together after the long
> pause.
> >>>>>>>       I'm at PST
> >>>>>>>       > >> > (UTC-08)
> >>>>>>>       > >> > > > time
> >>>>>>>       > >> > > > > zone, and I've seen pause can be as long as 6
> hours
> >>>>>>>       from PST 9am
> >>>>>>>       > >> to
> >>>>>>>       > >> > 3pm
> >>>>>>>       > >> > > > > (let alone the time needed to drain the queue
> >>>>>>>       afterwards).
> >>>>>>>       > >> > > > >
> >>>>>>>       > >> > > > > I think this has greatly impacted our
> productivity.
> >>>> I've
> >>>>>>>       > >> experienced
> >>>>>>>       > >> > > that
> >>>>>>>       > >> > > > > PRs submitted in the early morning of PST time
> zone
> >>>>>>>       won't finish
> >>>>>>>       > >> > their
> >>>>>>>       > >> > > > > build until late night of the same day.
> >>>>>>>       > >> > > > >
> >>>>>>>       > >> > > > > So my questions are:
> >>>>>>>       > >> > > > >
> >>>>>>>       > >> > > > > - Has anyone else experienced the same problem
> or
> >>>>>>>       have similar
> >>>>>>>       > >> > > > observation
> >>>>>>>       > >> > > > > on TravisCI? (I suspect it has things to do with
> >> time
> >>>>>>>       zone)
> >>>>>>>       > >> > > > >
> >>>>>>>       > >> > > > > - What pricing plan of TravisCI is Flink
> currently
> >>>>>>>       using? Is it
> >>>>>>>       > >> the
> >>>>>>>       > >> > > free
> >>>>>>>       > >> > > > > plan for open source projects? What are the
> >>>>>>>       guaranteed build
> >>>>>>>       > >> capacity
> >>>>>>>       > >> > > of
> >>>>>>>       > >> > > > > the current plan?
> >>>>>>>       > >> > > > >
> >>>>>>>       > >> > > > > - If the current pricing plan (either free or
> paid)
> >>>>>> can't
> >>>>>>>       > provide
> >>>>>>>       > >> > > stable
> >>>>>>>       > >> > > > > build capacity, can we upgrade to a higher
> priced
> >>>>>>>       plan with
> >>>>>>>       > larger
> >>>>>>>       > >> > and
> >>>>>>>       > >> > > > more
> >>>>>>>       > >> > > > > stable build capacity?
> >>>>>>>       > >> > > > >
> >>>>>>>       > >> > > > > BTW, another factor that contribute to the
> >>>>>>>       productivity problem
> >>>>>>>       > is
> >>>>>>>       > >> > that
> >>>>>>>       > >> > > > > our build is slow - we run full build for every
> PR
> >>>> and a
> >>>>>>>       > >> successful
> >>>>>>>       > >> > > full
> >>>>>>>       > >> > > > > build takes ~5h. We definitely have more
> options to
> >>>>>>>       solve it,
> >>>>>>>       > for
> >>>>>>>       > >> > > > instance,
> >>>>>>>       > >> > > > > modularize the build graphs and reuse artifacts
> >> from
> >>>> the
> >>>>>>>       > previous
> >>>>>>>       > >> > > build.
> >>>>>>>       > >> > > > > But I think that can be a big effort which is
> much
> >>>>>>>       harder to
> >>>>>>>       > >> > accomplish
> >>>>>>>       > >> > > > in
> >>>>>>>       > >> > > > > a short period of time and may deserve its own
> >>>> separate
> >>>>>>>       > >> discussion.
> >>>>>>>       > >> > > > >
> >>>>>>>       > >> > > > > [1]
> >> https://travis-ci.org/apache/flink/pull_requests
> >>>>>>>       > >> > > > >
> >>>>>>>       > >> > > > >
> >>>>>>>       > >> > > >
> >>>>>>>       > >> > >
> >>>>>>>       > >> >
> >>>>>>>       > >>
> >>>>>>>       > >
> >>>>>>>       >
> >>>>>>>
> >>>>>>>
> >>>>>>>       --
> >>>>>>>       Best Regards
> >>>>>>>
> >>>>>>>       Jeff Zhang
> >>>>>>>
> >>
>
>

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Reply via email to