Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Bowen Li Wed, 03 Jul 2019 21:52:40 -0700

Re: > Are they using their own Travis CI pool, or did the switch to an
entirely different CI service?


I reached out to Wes and Krisztián from Apache Arrow PMC. They are
currently moving away from ASF's Travis to their own in-house metal
machines at [1] with custom CI application at [2]. They've seen significant
improvement w.r.t both much higher performance and basically no resource
waiting time, "night-and-day" difference quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our project,
then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]


[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3] https://docs.travis-ci.com/user/migrate/open-source-repository-migration
[4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com



On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[email protected]> wrote:

> Are they using their own Travis CI pool, or did the switch to an
> entirely different CI service?
>
> If we can just switch to our own Travis pool, just for our project, then
> this might be something we can do fairly quickly?
>
> On 03/07/2019 05:55, Bowen Li wrote:
> > I responded in the INFRA ticket [1] that I believe they are using a wrong
> > metric against Flink and the total build time is a completely different
> > thing than guaranteed build capacity.
> >
> > My response:
> >
> > "As mentioned above, since I started to pay attention to Flink's build
> > queue a few tens of days ago, I'm in Seattle and I saw no build was
> kicking
> > off in PST daytime in weekdays for Flink. Our teammates in China and
> Europe
> > have also reported similar observations. So we need to evaluate how the
> > large total build time came from - if 1) your number and 2) our
> > observations from three locations that cover pretty much a full day, are
> > all true, I **guess** one reason can be that - highly likely the extra
> > build time came from weekends when other Apache projects may be idle and
> > Flink just drains hard its congested queue.
> >
> > Please be aware of that we're not complaining about the lack of resources
> > in general, I'm complaining about the lack of **stable, dedicated**
> > resources. An example for the latter one is, currently even if no build
> is
> > in Flink's queue and I submit a request to be the queue head in PST
> > morning, my build won't even start in 6-8+h. That is an absurd amount of
> > waiting time.
> >
> > That's saying, if ASF INFRA decides to adopt a quota system and grants
> > Flink five DEDICATED servers that runs all the time only for Flink,
> that'll
> > be PERFECT and can totally solve our problem now.
> >
> > Please be aware of that we're not complaining about the lack of resources
> > in general, I'm complaining about the lack of **stable, dedicated**
> > resources. An example for the latter one is, currently even if no build
> is
> > in Flink's queue and I submit a request to be the queue head in PST
> > morning, my build won't even start in 6-8+h. That is an absurd amount of
> > waiting time.
> >
> >
> > That's saying, if ASF INFRA decides to adopt a quota system and grants
> > Flink five DEDICATED servers that runs all the time only for Flink,
> that'll
> > be PERFECT and can totally solve our problem now.
> >
> > I feel what's missing in the ASF INFRA's Travis resource pool is some
> level
> > of build capacity SLAs and certainty"
> >
> >
> > Again, I believe there are differences in nature of these two problems,
> > long build time v.s. lack of dedicated build resource. That's saying,
> > shortening build time may relieve the situation, and may not. I'm sightly
> > negative on disabling IT cases for PRs, due to the downside is that we
> are
> > at risk of any potential bugs in PR that UTs doesn't catch, and may cost
> a
> > lot more to fix and if it slows others down or even block others, but am
> > open to others opinions on it.
> >
> > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to
> > solve our problem since INFRA's pool is fully shared and they have no
> > control and finer insights over resource allocation to a specific Apache
> > project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA
> > Travis pool (they are actually surprised Flink hasn't plan to do so). I
> > know that Spark is on its own build infra. If we all agree that funding
> our
> > own build infra, I'd be glad to help investigate any potential options
> > after releasing 1.9 since I'm super busy with 1.9 now.
> >
> > [1] https://issues.apache.org/jira/browse/INFRA-18533
> >
> >
> >
> > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler <[email protected]>
> wrote:
> >
> >> As a short-term stopgap, since we can assume this issue to become much
> >> worse in the following days/weeks, we could disable IT cases in PRs and
> >> only run them on master.
> >>
> >> On 02/07/2019 12:03, Chesnay Schepler wrote:
> >>> People really have to stop thinking that just because something works
> >>> for us it is also a good solution.
> >>> Also, please remember that our builds run for 2h from start to finish,
> >>> and not the 14 _minutes_ it takes for zeppelin.
> >>> We are dealing with an entirely different scale here, both in terms of
> >>> build times and number of builds.
> >>>
> >>> In this very thread people have been complaining about long queue
> >>> times for their builds. Surprise, other Apache projects have been
> >>> suffering the very same thing due to us not controlling our build
> >>> times. While switching services (be it Jenkins, CircleCI or whatever)
> >>> will possibly work for us (and these options are actually attractive,
> >>> like CircleCI's proper support for build artifacts), it will also
> >>> result in us likely negatively affecting other projects in significant
> >>> ways.
> >>>
> >>> Sure, the Jenkins setup has a good user experience for us, at the cost
> >>> of blocking Jenkins workers for a _lot_ of time. Right now we have 25
> >>> PR's in our queue; that's possibly 50h we'd consume of Jenkins
> >>> resources, and the European contributors haven't even really started
> yet.
> >>>
> >>> FYI, the latest INFRA response from INFRA-18533:
> >>>
> >>> "Our rough metrics shows that Flink used over 5800 hours of build time
> >>> last month. That is equal to EIGHT servers running 24/7 for the ENTIRE
> >>> MONTH. EIGHT. nonstop.
> >>> When we discovered this last night, we discussed it some and are going
> >>> to tune down Flink to allow only five executors maximum. We cannot
> >>> allow Flink to consume so much of a Foundation shared resource."
> >>>
> >>> So yes, we either
> >>> a) have to heavily reduce our CI usage or
> >>> b) fund our own, either maintaining it ourselves or donating to Apache.
> >>>
> >>> On 02/07/2019 05:11, Bowen Li wrote:
> >>>> By looking at the git history of the Jenkins script, its core part
> >>>> was finished in March 2017 (and only two minor update in 2017/2018),
> >>>> so it's been running for over two years now and feels like Zepplin
> >>>> community has been quite happy with it. @Jeff Zhang
> >>>> <mailto:[email protected]> can you share your insights and user
> >>>> experience with the Jenkins+Travis approach?
> >>>>
> >>>> Things like:
> >>>>
> >>>> - has the approach completely solved the resource capacity problem
> >>>> for Zepplin community? is Zepplin community happy with the result?
> >>>> - is the whole configuration chain stable (e.g. uptime) enough?
> >>>> - how often do you need to maintain the Jenkins infra? how many
> >>>> people are usually involved in maintenance and bug-fixes?
> >>>>
> >>>> The downside of this approach seems mostly to be on the maintenance
> >>>> to me - maintain the script and Jenkins infra.
> >>>>
> >>>> ** Having Our Own Travis-CI.com Account **
> >>>>
> >>>> Another alternative I've been thinking of is to have our own
> >>>> travis-ci.com <http://travis-ci.com> account with paid dedicated
> >>>> resources. Note travis-ci.org <http://travis-ci.org> is the free
> >>>> version and travis-ci.com <http://travis-ci.com> is the commercial
> >>>> version. We currently use a shared resource pool managed by ASK INFRA
> >>>> team on travis-ci.org <http://travis-ci.org>, but we have no control
> >>>> over it - we can't see how it's configured, how much resources are
> >>>> available, how resources are allocated among Apache projects, etc.
> >>>> The nice thing about having an account on travis-ci.com
> >>>> <http://travis-ci.com> are:
> >>>>
> >>>> - relatively low cost with much better resource guarantee than what
> >>>> we currently have [1]: $249/month with 5 dedicated concurrency,
> >>>> $489/month with 10 concurrency
> >>>> - low maintenance work compared to using Jenkins
> >>>> - (potentially) no migration cost according to Travis's doc [2]
> >>>> (pending verification)
> >>>> - full control over the build capacity/configuration compared to
> >>>> using ASF INFRA's pool
> >>>>
> >>>> I'd be surprised if we as such a vibrant community cannot find and
> >>>> fund $249*12=$2988 a year in exchange for a much better developer
> >>>> experience and much higher productivity.
> >>>>
> >>>> [1] https://travis-ci.com/plans
> >>>> [2]
> >>>>
> >>
> https://docs.travis-ci.com/user/migrate/open-source-repository-migration
> >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <[email protected]
> >>>> <mailto:[email protected]>> wrote:
> >>>>
> >>>>      So yes, the Jenkins job keeps pulling the state from Travis
> until it
> >>>>      finishes.
> >>>>
> >>>>      Note sure I'm comfortable with the idea of using Jenkins workers
> >>>>      just to
> >>>>      idle for a several hours.
> >>>>
> >>>>      On 29/06/2019 14:56, Jeff Zhang wrote:
> >>>>      > Here's what zeppelin community did, we make a python script to
> >>>>      check the
> >>>>      > build status of pull request.
> >>>>      > Here's script:
> >>>>      > https://github.com/apache/zeppelin/blob/master/travis_check.py
> >>>>      >
> >>>>      > And this is the script we used in Jenkins build job.
> >>>>      >
> >>>>      > if [ -f "travis_check.py" ]; then
> >>>>      >    git log -n 1
> >>>>      >    STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull
> >>>>      request.*from.*" | sed
> >>>>      > 's/.*GitHub pull request <a
> >>>>      > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
> >>>>      >    AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g')
> >>>>      >    PR=$(echo $STATUS | awk '{print $1}' | sed
> >>>> 's/.*[/]\(.*\)$/\1/g')
> >>>>      >    #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}')
> >>>>      >    #if [ -z $COMMIT ]; then
> >>>>      >    #  COMMIT=$(curl -s
> >>>>      https://api.github.com/repos/apache/zeppelin/pulls/$PR
> >>>>      > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' '
> '
> >>>>      | sed
> >>>>      > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v
> >>>>      "apache:" |
> >>>>      > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
> >>>>      >    #fi
> >>>>      >
> >>>>      >    # get commit hash from PR
> >>>>      >    COMMIT=$(curl -s
> >>>>      https://api.github.com/repos/apache/zeppelin/pulls/$PR |
> >>>>      > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' '
> >>>> | sed
> >>>>      > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v
> >>>>      "apache:" |
> >>>>      > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
> >>>>      >    sleep 30 # sleep few moment to wait travis starts the build
> >>>>      >    RET_CODE=0
> >>>>      >    python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
> >>>>      >    if [ $RET_CODE -eq 2 ]; then # try with repository name when
> >>>>      travis-ci is
> >>>>      > not available in the account
> >>>>      >      RET_CODE=0
> >>>>      >      AUTHOR=$(curl -s
> >>>>      https://api.github.com/repos/apache/zeppelin/pulls/$PR
> >>>>      > | grep '"full_name":' | grep -v "apache/zeppelin" | sed
> >>>>      > 's/.*[:][^"]*["]\([^/]*\).*/\1/g')
> >>>>      >    python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
> >>>>      >    fi
> >>>>      >
> >>>>      >    if [ $RET_CODE -eq 2 ]; then # fail with can't find build
> >>>>      information in
> >>>>      > the travis
> >>>>      >      set +x
> >>>>      >      echo
> "-----------------------------------------------------"
> >>>>      >      echo "Looks like travis-ci is not configured for your
> fork."
> >>>>      >      echo "Please setup by swich on 'zeppelin' repository at
> >>>>      > https://travis-ci.org/profile and travis-ci."
> >>>>      >      echo "And then make sure 'Build branch updates' option is
> >>>>      enabled in
> >>>>      > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings
> >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>."
> >>>>      >      echo ""
> >>>>      >      echo "To trigger CI after setup, you will need ammend your
> >>>>      last commit
> >>>>      > with"
> >>>>      >      echo "git commit --amend"
> >>>>      >      echo "git push your-remote HEAD --force"
> >>>>      >      echo ""
> >>>>      >      echo "See
> >>>>      >
> >>>>
> >>
> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration
> >>>>      > ."
> >>>>      >    fi
> >>>>      >
> >>>>      >    exit $RET_CODE
> >>>>      > else
> >>>>      >    set +x
> >>>>      >    echo "travis_check.py does not exists"
> >>>>      >    exit 1
> >>>>      > fi
> >>>>      >
> >>>>      > Chesnay Schepler <[email protected]
> >>>>      <mailto:[email protected]>> 于2019年6月29日周六 下午3:17写道：
> >>>>      >
> >>>>      >> Does this imply that a Jenkins job is active as long as the
> >>>>      Travis build
> >>>>      >> runs?
> >>>>      >>
> >>>>      >> On 26/06/2019 21:28, Bowen Li wrote:
> >>>>      >>> Hi,
> >>>>      >>>
> >>>>      >>> @Dawid, I think the "long test running" as I mentioned in the
> >>>>      first
> >>>>      >> email,
> >>>>      >>> also as you guys said, belongs to "a big effort which is much
> >>>>      harder to
> >>>>      >>> accomplish in a short period of time and may deserve its own
> >>>>      separate
> >>>>      >>> discussion". Thus I didn't include it in what we can do in a
> >>>>      foreseeable
> >>>>      >>> short term.
> >>>>      >>>
> >>>>      >>> Besides, I don't think that's the ultimate reason for lack of
> >>>>      build
> >>>>      >>> resources. Even if the build is shortened to something like
> >>>>      2h, the
> >>>>      >>> problems of no build machine works about 6 or more hours in
> >>>>      PST daytime
> >>>>      >>> that I described will still happen, because no machine from
> >>>>      ASF INFRA's
> >>>>      >>> pool is allocated to Flink. As I have paid close attention to
> >>>>      the build
> >>>>      >>> queue in the past few weekdays, it's a pretty clear pattern
> now.
> >>>>      >>>
> >>>>      >>> **The ultimate root cause** for that is - we don't have any
> >>>>      **dedicated**
> >>>>      >>> build resources that we can stably rely on. I'm actually ok
> to
> >>>>      wait for a
> >>>>      >>> long time if there are build requests running, it means at
> >>>>      least we are
> >>>>      >>> making progress. But I'm not ok with no build resource. A
> >>>>      better place I
> >>>>      >>> think we should aim at in short term is to always have at
> >>>>      least a central
> >>>>      >>> pool (can be 3 or 5) of machines dedicated to build Flink at
> >>>>      any time, or
> >>>>      >>> maybe use users resources.
> >>>>      >>>
> >>>>      >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin
> >>>>      community is
> >>>>      >>> using a Jenkins job to automatically build on users' travis
> >>>>      account and
> >>>>      >>> link the result back to github PR. I guess the Jenkins job
> >>>>      would fetch
> >>>>      >>> latest upstream master and build the PR against it. Jeff has
> >>>> filed
> >>>>      >> tickets
> >>>>      >>> to learn and get access to the Jenkins infra. It'll better to
> >>>>      fully
> >>>>      >>> understand it first before judging this approach.
> >>>>      >>>
> >>>>      >>> I also heard good things about CircleCI, and ASF INFRA seems
> >>>>      to have a
> >>>>      >> pool
> >>>>      >>> of build capacity there too. Can be an alternative to
> consider.
> >>>>      >>>
> >>>>      >>>
> >>>>      >>>
> >>>>      >>>
> >>>>      >>>
> >>>>      >>>
> >>>>      >>>
> >>>>      >>>
> >>>>      >>>
> >>>>      >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <
> >>>>      >> [email protected] <mailto:[email protected]>>
> >>>>      >>> wrote:
> >>>>      >>>
> >>>>      >>>> Sorry to jump in late, but I think Bowen missed the most
> >>>>      important point
> >>>>      >>>> from Chesnay's previous message in the summary. The ultimate
> >>>>      reason for
> >>>>      >>>> all the problems is that the tests take close to 2 hours to
> >>>>      run already.
> >>>>      >>>> I fully support this claim: "Unless people start caring
> about
> >>>>      test times
> >>>>      >>>> before adding them, this issue cannot be solved"
> >>>>      >>>>
> >>>>      >>>> This is also another reason why using user's Travis account
> >>>>      won't help.
> >>>>      >>>> Every few weeks we reach the user's time limit for a single
> >>>>      profile.
> >>>>      >>>> This makes the user's builds simply fail, until we either
> >>>>      properly
> >>>>      >>>> decrease the time the tests take (which I am not sure we
> ever
> >>>>      did) or
> >>>>      >>>> postpone the problem by splitting into more profiles. (Note
> >>>>      that the ASF
> >>>>      >>>> Travis account has higher time limits)
> >>>>      >>>>
> >>>>      >>>> Best,
> >>>>      >>>>
> >>>>      >>>> Dawid
> >>>>      >>>>
> >>>>      >>>> On 26/06/2019 09:36, Robert Metzger wrote:
> >>>>      >>>>> Do we know if using "the best" available hardware would
> >>>>      improve the
> >>>>      >> build
> >>>>      >>>>> times?
> >>>>      >>>>> Imagine we would run the build on machines with plenty of
> >>>>      main memory
> >>>>      >> to
> >>>>      >>>>> mount everything to ramdisk + the latest CPU architecture?
> >>>>      >>>>>
> >>>>      >>>>> Throwing hardware at the problem could help reduce the time
> >>>>      of an
> >>>>      >>>>> individual build, and using our own infrastructure would
> >>>>      remove our
> >>>>      >>>>> dependency on Apache's Travis account (with the obvious
> >>>>      downside of
> >>>>      >>>> having
> >>>>      >>>>> to maintain the infrastructure)
> >>>>      >>>>> We could use an open source travis alternative, to have a
> >>>>      similar
> >>>>      >>>>> experience and make the migration easy.
> >>>>      >>>>>
> >>>>      >>>>>
> >>>>      >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler
> >>>>      <[email protected] <mailto:[email protected]>>
> >>>>      >>>> wrote:
> >>>>      >>>>>>    From what I gathered, there's no special sauce that the
> >>>>      Zeppelin
> >>>>      >>>>>> project uses which actually integrates a users Travis
> >>>>      account into the
> >>>>      >>>> PR.
> >>>>      >>>>>> They just disabled Travis for PRs. And that's kind of it.
> >>>>      >>>>>>
> >>>>      >>>>>> Naturally we can do this (duh) and safe the ASF a fair
> >>>>      amount of
> >>>>      >>>>>> resources, but there are downsides:
> >>>>      >>>>>>
> >>>>      >>>>>> The discoverability of the Travis check takes a nose-dive.
> >>>>      Either we
> >>>>      >>>>>> require every contributor to always, an every commit, also
> >>>>      post a
> >>>>      >> Travis
> >>>>      >>>>>> build, or we have the reviewer sift through the
> >>>>      contributors account
> >>>>      >> to
> >>>>      >>>>>> find it.
> >>>>      >>>>>>
> >>>>      >>>>>> This is rather cumbersome. Additionally, it's also not
> >>>>      equivalent to
> >>>>      >>>>>> having a PR build.
> >>>>      >>>>>>
> >>>>      >>>>>> A normal branch build takes a branch as is and tests it. A
> >>>>      PR build
> >>>>      >>>>>> merges the branch into master, and then runs it. (Fun
> fact:
> >>>>      This is
> >>>>      >> why
> >>>>      >>>>>> a PR without merge conflicts is not being run on Travis.)
> >>>>      >>>>>>
> >>>>      >>>>>> And ultimately, everyone can already make use of this
> >>>>      approach anyway.
> >>>>      >>>>>>
> >>>>      >>>>>> On 25/06/2019 08:02, Jark Wu wrote:
> >>>>      >>>>>>> Hi Jeff,
> >>>>      >>>>>>>
> >>>>      >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a
> >>>>      good idea to
> >>>>      >>>>>>> leverage user's travis account.
> >>>>      >>>>>>> In this way, we can have almost unlimited concurrent
> build
> >>>>      jobs and
> >>>>      >>>>>>> developers can restart build by themselves (currently
> only
> >>>>      committers
> >>>>      >>>>>>> can restart PR's build).
> >>>>      >>>>>>>
> >>>>      >>>>>>> But I'm still not very clear how to integrate user's
> >>>>      travis build
> >>>>      >> into
> >>>>      >>>>>>> the Flink pull request's build automatically. Can you
> >>>>      explain more in
> >>>>      >>>>>>> detail?
> >>>>      >>>>>>>
> >>>>      >>>>>>> Another question: does travis only build branches for
> user
> >>>>      account?
> >>>>      >>>>>>> My concern is that builds for PRs will rebase user's
> >>>>      commits against
> >>>>      >>>>>>> current master branch.
> >>>>      >>>>>>> This will help us to find problems before merge.  Builds
> >>>>      for branches
> >>>>      >>>>>>> will lose the impact of new commits in master.
> >>>>      >>>>>>> How does Zeppelin solve this problem?
> >>>>      >>>>>>>
> >>>>      >>>>>>> Thanks again for sharing the idea.
> >>>>      >>>>>>>
> >>>>      >>>>>>> Regards,
> >>>>      >>>>>>> Jark
> >>>>      >>>>>>>
> >>>>      >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <
> [email protected]
> >>>>      <mailto:[email protected]>
> >>>>      >>>>>>> <mailto:[email protected] <mailto:[email protected]>>>
> wrote:
> >>>>      >>>>>>>
> >>>>      >>>>>>>       Hi Folks,
> >>>>      >>>>>>>
> >>>>      >>>>>>>       Zeppelin meet this kind of issue before, we solve
> >>>> it by
> >>>>      >> delegating
> >>>>      >>>>>>>       each
> >>>>      >>>>>>>       one's PR build to his travis account (Everyone can
> >>>>      have 5 free
> >>>>      >>>>>>>       slot for
> >>>>      >>>>>>>       travis build).
> >>>>      >>>>>>>       Apache account travis build is only triggered when
> >>>>      PR is merged.
> >>>>      >>>>>>>
> >>>>      >>>>>>>
> >>>>      >>>>>>>
> >>>>      >>>>>>>       Kurt Young <[email protected]
> >>>>      <mailto:[email protected]> <mailto:[email protected]
> >>>>      <mailto:[email protected]>>>
> >>>>      >>>>>>>       于2019年6月25日周二 上午10:16写道：
> >>>>      >>>>>>>
> >>>>      >>>>>>>       > (Forgot to cc George)
> >>>>      >>>>>>>       >
> >>>>      >>>>>>>       > Best,
> >>>>      >>>>>>>       > Kurt
> >>>>      >>>>>>>       >
> >>>>      >>>>>>>       >
> >>>>      >>>>>>>       > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young
> >>>>      <[email protected] <mailto:[email protected]>
> >>>>      >>>>>>> <mailto:[email protected] <mailto:[email protected]>>>
> >>>>      wrote:
> >>>>      >>>>>>>       >
> >>>>      >>>>>>>       > > Hi Bowen,
> >>>>      >>>>>>>       > >
> >>>>      >>>>>>>       > > Thanks for bringing this up. We actually have
> >>>>      discussed
> >>>>      >> about
> >>>>      >>>>>>>       this, and I
> >>>>      >>>>>>>       > > think Till and George have
> >>>>      >>>>>>>       > > already spend sometime investigating it. I have
> >>>>      cced both of
> >>>>      >>>>>>>       them, and
> >>>>      >>>>>>>       > > maybe they can share
> >>>>      >>>>>>>       > > their findings.
> >>>>      >>>>>>>       > >
> >>>>      >>>>>>>       > > Best,
> >>>>      >>>>>>>       > > Kurt
> >>>>      >>>>>>>       > >
> >>>>      >>>>>>>       > >
> >>>>      >>>>>>>       > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu
> >>>>      <[email protected] <mailto:[email protected]>
> >>>>      >>>>>>> <mailto:[email protected] <mailto:[email protected]>>>
> >>>>      wrote:
> >>>>      >>>>>>>       > >
> >>>>      >>>>>>>       > >> Hi Bowen,
> >>>>      >>>>>>>       > >>
> >>>>      >>>>>>>       > >> Thanks for bringing this. We also suffered
> from
> >>>>      the long
> >>>>      >>>>>>>       build time.
> >>>>      >>>>>>>       > >> I agree that we should focus on solving build
> >>>>      capacity
> >>>>      >>>>>>>       problem in the
> >>>>      >>>>>>>       > >> thread.
> >>>>      >>>>>>>       > >>
> >>>>      >>>>>>>       > >> My observation is there is only one build is
> >>>>      running, all
> >>>>      >> the
> >>>>      >>>>>>>       others
> >>>>      >>>>>>>       > >> (other
> >>>>      >>>>>>>       > >> PRs, master) are pending.
> >>>>      >>>>>>>       > >> The pricing plan[1] of travis shows it can
> >>>> support
> >>>>      >> concurrent
> >>>>      >>>>>>>       build
> >>>>      >>>>>>>       > jobs.
> >>>>      >>>>>>>       > >> But I don't know which plan we are using,
> might
> >>>>      be the free
> >>>>      >>>>>>>       plan for
> >>>>      >>>>>>>       > open
> >>>>      >>>>>>>       > >> source.
> >>>>      >>>>>>>       > >>
> >>>>      >>>>>>>       > >> I cc-ed Chesnay who may have some experience
> on
> >>>>      Travis.
> >>>>      >>>>>>>       > >>
> >>>>      >>>>>>>       > >> Regards,
> >>>>      >>>>>>>       > >> Jark
> >>>>      >>>>>>>       > >>
> >>>>      >>>>>>>       > >> [1]: https://travis-ci.com/plans
> >>>>      >>>>>>>       > >>
> >>>>      >>>>>>>       > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <
> >>>>      >> [email protected] <mailto:[email protected]>
> >>>>      >>>>>>> <mailto:[email protected]
> >>>>      <mailto:[email protected]>>> wrote:
> >>>>      >>>>>>>       > >>
> >>>>      >>>>>>>       > >> > Hi Steven,
> >>>>      >>>>>>>       > >> >
> >>>>      >>>>>>>       > >> > I think you may not read what I wrote. The
> >>>>      discussion is
> >>>>      >>>> about
> >>>>      >>>>>>>       > "unstable
> >>>>      >>>>>>>       > >> > build **capacity**", in another word
> >>>>      "unstable / lack of
> >>>>      >>>> build
> >>>>      >>>>>>>       > >> resources",
> >>>>      >>>>>>>       > >> > not "unstable build".
> >>>>      >>>>>>>       > >> >
> >>>>      >>>>>>>       > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
> >>>>      >>>>>>>       <[email protected] <mailto:[email protected]
> >
> >>>>      <mailto:[email protected] <mailto:[email protected]>>>
> >>>>      >>>>>>>       > wrote:
> >>>>      >>>>>>>       > >> >
> >>>>      >>>>>>>       > >> > > long and sometimes unstable build is
> >>>>      definitely a pain
> >>>>      >>>>>> point.
> >>>>      >>>>>>>       > >> > >
> >>>>      >>>>>>>       > >> > > I suspect the build failure here in
> >>>>      >> flink-connector-kafka
> >>>>      >>>>>>>       is not
> >>>>      >>>>>>>       > >> related
> >>>>      >>>>>>>       > >> > to
> >>>>      >>>>>>>       > >> > > my change. but there is no easy re-run the
> >>>>      build on
> >>>>      >>>>>>>       travis UI.
> >>>>      >>>>>>>       > Google
> >>>>      >>>>>>>       > >> > > search showed a trick of close-and-open
> the
> >>>>      PR will
> >>>>      >>>>>>>       trigger rebuild.
> >>>>      >>>>>>>       > >> but
> >>>>      >>>>>>>       > >> > > that could add noises to the PR
> activities.
> >>>>      >>>>>>>       > >> > >
> >>>>      https://travis-ci.org/apache/flink/jobs/545555519
> >>>>      >>>>>>>       > >> > >
> >>>>      >>>>>>>       > >> > > travis-ci for my personal repo often
> failed
> >>>>      with
> >>>>      >>>>>>>       exceeding time
> >>>>      >>>>>>>       > limit
> >>>>      >>>>>>>       > >> > after
> >>>>      >>>>>>>       > >> > > 4+ hours.
> >>>>      >>>>>>>       > >> > > The job exceeded the maximum time limit
> for
> >>>>      jobs, and
> >>>>      >> has
> >>>>      >>>>>>>       been
> >>>>      >>>>>>>       > >> > terminated.
> >>>>      >>>>>>>       > >> > >
> >>>>      >>>>>>>       > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
> >>>>      >>>>>>>       <[email protected] <mailto:[email protected]>
> >>>>      <mailto:[email protected] <mailto:[email protected]>>>
> >>>>      >>>>>>>       > wrote:
> >>>>      >>>>>>>       > >> > >
> >>>>      >>>>>>>       > >> > > >
> >>>>      https://travis-ci.org/apache/flink/builds/549681530
> >>>>      >>>>>>>       This build
> >>>>      >>>>>>>       > >> > request
> >>>>      >>>>>>>       > >> > > > has
> >>>>      >>>>>>>       > >> > > > been sitting at **HEAD of the queue**
> >>>>      since I first
> >>>>      >> saw
> >>>>      >>>>>>>       it at PST
> >>>>      >>>>>>>       > >> > 10:30am
> >>>>      >>>>>>>       > >> > > > (not sure how long it's been there
> before
> >>>>      10:30am).
> >>>>      >>>>>>>       It's PST
> >>>>      >>>>>>>       > 4:12pm
> >>>>      >>>>>>>       > >> now
> >>>>      >>>>>>>       > >> > > and
> >>>>      >>>>>>>       > >> > > > it hasn't started yet.
> >>>>      >>>>>>>       > >> > > >
> >>>>      >>>>>>>       > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
> >>>>      >>>>>>>       <[email protected] <mailto:[email protected]>
> >>>>      <mailto:[email protected] <mailto:[email protected]>>>
> >>>>      >>>>>>>       > >> wrote:
> >>>>      >>>>>>>       > >> > > >
> >>>>      >>>>>>>       > >> > > > > Hi devs,
> >>>>      >>>>>>>       > >> > > > >
> >>>>      >>>>>>>       > >> > > > > I've been experiencing the pain
> >>>>      resulting from lack
> >>>>      >>>>>>>       of stable
> >>>>      >>>>>>>       > >> build
> >>>>      >>>>>>>       > >> > > > > capacity on Travis for Flink PRs [1].
> >>>>      >> Specifically, I
> >>>>      >>>>>>>       noticed
> >>>>      >>>>>>>       > >> often
> >>>>      >>>>>>>       > >> > > that
> >>>>      >>>>>>>       > >> > > > no
> >>>>      >>>>>>>       > >> > > > > build in the queue is making any
> >>>>      progress for
> >>>>      >> hours,
> >>>>      >>>> and
> >>>>      >>>>>>>       > suddenly
> >>>>      >>>>>>>       > >> 5
> >>>>      >>>>>>>       > >> > or
> >>>>      >>>>>>>       > >> > > 6
> >>>>      >>>>>>>       > >> > > > > builds kick off all together after the
> >>>>      long pause.
> >>>>      >>>>>>>       I'm at PST
> >>>>      >>>>>>>       > >> > (UTC-08)
> >>>>      >>>>>>>       > >> > > > time
> >>>>      >>>>>>>       > >> > > > > zone, and I've seen pause can be as
> >>>>      long as 6 hours
> >>>>      >>>>>>>       from PST 9am
> >>>>      >>>>>>>       > >> to
> >>>>      >>>>>>>       > >> > 3pm
> >>>>      >>>>>>>       > >> > > > > (let alone the time needed to drain
> the
> >>>>      queue
> >>>>      >>>>>>>       afterwards).
> >>>>      >>>>>>>       > >> > > > >
> >>>>      >>>>>>>       > >> > > > > I think this has greatly impacted our
> >>>>      productivity.
> >>>>      >>>> I've
> >>>>      >>>>>>>       > >> experienced
> >>>>      >>>>>>>       > >> > > that
> >>>>      >>>>>>>       > >> > > > > PRs submitted in the early morning of
> >>>>      PST time zone
> >>>>      >>>>>>>       won't finish
> >>>>      >>>>>>>       > >> > their
> >>>>      >>>>>>>       > >> > > > > build until late night of the same
> day.
> >>>>      >>>>>>>       > >> > > > >
> >>>>      >>>>>>>       > >> > > > > So my questions are:
> >>>>      >>>>>>>       > >> > > > >
> >>>>      >>>>>>>       > >> > > > > - Has anyone else experienced the same
> >>>>      problem or
> >>>>      >>>>>>>       have similar
> >>>>      >>>>>>>       > >> > > > observation
> >>>>      >>>>>>>       > >> > > > > on TravisCI? (I suspect it has things
> >>>>      to do with
> >>>>      >> time
> >>>>      >>>>>>>       zone)
> >>>>      >>>>>>>       > >> > > > >
> >>>>      >>>>>>>       > >> > > > > - What pricing plan of TravisCI is
> >>>>      Flink currently
> >>>>      >>>>>>>       using? Is it
> >>>>      >>>>>>>       > >> the
> >>>>      >>>>>>>       > >> > > free
> >>>>      >>>>>>>       > >> > > > > plan for open source projects? What
> >>>> are the
> >>>>      >>>>>>>       guaranteed build
> >>>>      >>>>>>>       > >> capacity
> >>>>      >>>>>>>       > >> > > of
> >>>>      >>>>>>>       > >> > > > > the current plan?
> >>>>      >>>>>>>       > >> > > > >
> >>>>      >>>>>>>       > >> > > > > - If the current pricing plan (either
> >>>>      free or paid)
> >>>>      >>>>>> can't
> >>>>      >>>>>>>       > provide
> >>>>      >>>>>>>       > >> > > stable
> >>>>      >>>>>>>       > >> > > > > build capacity, can we upgrade to a
> >>>>      higher priced
> >>>>      >>>>>>>       plan with
> >>>>      >>>>>>>       > larger
> >>>>      >>>>>>>       > >> > and
> >>>>      >>>>>>>       > >> > > > more
> >>>>      >>>>>>>       > >> > > > > stable build capacity?
> >>>>      >>>>>>>       > >> > > > >
> >>>>      >>>>>>>       > >> > > > > BTW, another factor that contribute to
> >>>> the
> >>>>      >>>>>>>       productivity problem
> >>>>      >>>>>>>       > is
> >>>>      >>>>>>>       > >> > that
> >>>>      >>>>>>>       > >> > > > > our build is slow - we run full build
> >>>>      for every PR
> >>>>      >>>> and a
> >>>>      >>>>>>>       > >> successful
> >>>>      >>>>>>>       > >> > > full
> >>>>      >>>>>>>       > >> > > > > build takes ~5h. We definitely have
> >>>>      more options to
> >>>>      >>>>>>>       solve it,
> >>>>      >>>>>>>       > for
> >>>>      >>>>>>>       > >> > > > instance,
> >>>>      >>>>>>>       > >> > > > > modularize the build graphs and reuse
> >>>>      artifacts
> >>>>      >> from
> >>>>      >>>> the
> >>>>      >>>>>>>       > previous
> >>>>      >>>>>>>       > >> > > build.
> >>>>      >>>>>>>       > >> > > > > But I think that can be a big effort
> >>>>      which is much
> >>>>      >>>>>>>       harder to
> >>>>      >>>>>>>       > >> > accomplish
> >>>>      >>>>>>>       > >> > > > in
> >>>>      >>>>>>>       > >> > > > > a short period of time and may deserve
> >>>>      its own
> >>>>      >>>> separate
> >>>>      >>>>>>>       > >> discussion.
> >>>>      >>>>>>>       > >> > > > >
> >>>>      >>>>>>>       > >> > > > > [1]
> >>>>      >> https://travis-ci.org/apache/flink/pull_requests
> >>>>      >>>>>>>       > >> > > > >
> >>>>      >>>>>>>       > >> > > > >
> >>>>      >>>>>>>       > >> > > >
> >>>>      >>>>>>>       > >> > >
> >>>>      >>>>>>>       > >> >
> >>>>      >>>>>>>       > >>
> >>>>      >>>>>>>       > >
> >>>>      >>>>>>>       >
> >>>>      >>>>>>>
> >>>>      >>>>>>>
> >>>>      >>>>>>>       --
> >>>>      >>>>>>>       Best Regards
> >>>>      >>>>>>>
> >>>>      >>>>>>>       Jeff Zhang
> >>>>      >>>>>>>
> >>>>      >>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Reply via email to