Re: > Are they using their own Travis CI pool, or did the switch to an entirely different CI service?
I reached out to Wes and Krisztián from Apache Arrow PMC. They are currently moving away from ASF's Travis to their own in-house metal machines at [1] with custom CI application at [2]. They've seen significant improvement w.r.t both much higher performance and basically no resource waiting time, "night-and-day" difference quoting Wes. Re: > If we can just switch to our own Travis pool, just for our project, then this might be something we can do fairly quickly? I believe so, according to [3] and [4] [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> [2] https://github.com/ursa-labs/ursabot [3] https://docs.travis-ci.com/user/migrate/open-source-repository-migration [4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <ches...@apache.org> wrote: > Are they using their own Travis CI pool, or did the switch to an > entirely different CI service? > > If we can just switch to our own Travis pool, just for our project, then > this might be something we can do fairly quickly? > > On 03/07/2019 05:55, Bowen Li wrote: > > I responded in the INFRA ticket [1] that I believe they are using a wrong > > metric against Flink and the total build time is a completely different > > thing than guaranteed build capacity. > > > > My response: > > > > "As mentioned above, since I started to pay attention to Flink's build > > queue a few tens of days ago, I'm in Seattle and I saw no build was > kicking > > off in PST daytime in weekdays for Flink. Our teammates in China and > Europe > > have also reported similar observations. So we need to evaluate how the > > large total build time came from - if 1) your number and 2) our > > observations from three locations that cover pretty much a full day, are > > all true, I **guess** one reason can be that - highly likely the extra > > build time came from weekends when other Apache projects may be idle and > > Flink just drains hard its congested queue. > > > > Please be aware of that we're not complaining about the lack of resources > > in general, I'm complaining about the lack of **stable, dedicated** > > resources. An example for the latter one is, currently even if no build > is > > in Flink's queue and I submit a request to be the queue head in PST > > morning, my build won't even start in 6-8+h. That is an absurd amount of > > waiting time. > > > > That's saying, if ASF INFRA decides to adopt a quota system and grants > > Flink five DEDICATED servers that runs all the time only for Flink, > that'll > > be PERFECT and can totally solve our problem now. > > > > Please be aware of that we're not complaining about the lack of resources > > in general, I'm complaining about the lack of **stable, dedicated** > > resources. An example for the latter one is, currently even if no build > is > > in Flink's queue and I submit a request to be the queue head in PST > > morning, my build won't even start in 6-8+h. That is an absurd amount of > > waiting time. > > > > > > That's saying, if ASF INFRA decides to adopt a quota system and grants > > Flink five DEDICATED servers that runs all the time only for Flink, > that'll > > be PERFECT and can totally solve our problem now. > > > > I feel what's missing in the ASF INFRA's Travis resource pool is some > level > > of build capacity SLAs and certainty" > > > > > > Again, I believe there are differences in nature of these two problems, > > long build time v.s. lack of dedicated build resource. That's saying, > > shortening build time may relieve the situation, and may not. I'm sightly > > negative on disabling IT cases for PRs, due to the downside is that we > are > > at risk of any potential bugs in PR that UTs doesn't catch, and may cost > a > > lot more to fix and if it slows others down or even block others, but am > > open to others opinions on it. > > > > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to > > solve our problem since INFRA's pool is fully shared and they have no > > control and finer insights over resource allocation to a specific Apache > > project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA > > Travis pool (they are actually surprised Flink hasn't plan to do so). I > > know that Spark is on its own build infra. If we all agree that funding > our > > own build infra, I'd be glad to help investigate any potential options > > after releasing 1.9 since I'm super busy with 1.9 now. > > > > [1] https://issues.apache.org/jira/browse/INFRA-18533 > > > > > > > > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler <ches...@apache.org> > wrote: > > > >> As a short-term stopgap, since we can assume this issue to become much > >> worse in the following days/weeks, we could disable IT cases in PRs and > >> only run them on master. > >> > >> On 02/07/2019 12:03, Chesnay Schepler wrote: > >>> People really have to stop thinking that just because something works > >>> for us it is also a good solution. > >>> Also, please remember that our builds run for 2h from start to finish, > >>> and not the 14 _minutes_ it takes for zeppelin. > >>> We are dealing with an entirely different scale here, both in terms of > >>> build times and number of builds. > >>> > >>> In this very thread people have been complaining about long queue > >>> times for their builds. Surprise, other Apache projects have been > >>> suffering the very same thing due to us not controlling our build > >>> times. While switching services (be it Jenkins, CircleCI or whatever) > >>> will possibly work for us (and these options are actually attractive, > >>> like CircleCI's proper support for build artifacts), it will also > >>> result in us likely negatively affecting other projects in significant > >>> ways. > >>> > >>> Sure, the Jenkins setup has a good user experience for us, at the cost > >>> of blocking Jenkins workers for a _lot_ of time. Right now we have 25 > >>> PR's in our queue; that's possibly 50h we'd consume of Jenkins > >>> resources, and the European contributors haven't even really started > yet. > >>> > >>> FYI, the latest INFRA response from INFRA-18533: > >>> > >>> "Our rough metrics shows that Flink used over 5800 hours of build time > >>> last month. That is equal to EIGHT servers running 24/7 for the ENTIRE > >>> MONTH. EIGHT. nonstop. > >>> When we discovered this last night, we discussed it some and are going > >>> to tune down Flink to allow only five executors maximum. We cannot > >>> allow Flink to consume so much of a Foundation shared resource." > >>> > >>> So yes, we either > >>> a) have to heavily reduce our CI usage or > >>> b) fund our own, either maintaining it ourselves or donating to Apache. > >>> > >>> On 02/07/2019 05:11, Bowen Li wrote: > >>>> By looking at the git history of the Jenkins script, its core part > >>>> was finished in March 2017 (and only two minor update in 2017/2018), > >>>> so it's been running for over two years now and feels like Zepplin > >>>> community has been quite happy with it. @Jeff Zhang > >>>> <mailto:zjf...@gmail.com> can you share your insights and user > >>>> experience with the Jenkins+Travis approach? > >>>> > >>>> Things like: > >>>> > >>>> - has the approach completely solved the resource capacity problem > >>>> for Zepplin community? is Zepplin community happy with the result? > >>>> - is the whole configuration chain stable (e.g. uptime) enough? > >>>> - how often do you need to maintain the Jenkins infra? how many > >>>> people are usually involved in maintenance and bug-fixes? > >>>> > >>>> The downside of this approach seems mostly to be on the maintenance > >>>> to me - maintain the script and Jenkins infra. > >>>> > >>>> ** Having Our Own Travis-CI.com Account ** > >>>> > >>>> Another alternative I've been thinking of is to have our own > >>>> travis-ci.com <http://travis-ci.com> account with paid dedicated > >>>> resources. Note travis-ci.org <http://travis-ci.org> is the free > >>>> version and travis-ci.com <http://travis-ci.com> is the commercial > >>>> version. We currently use a shared resource pool managed by ASK INFRA > >>>> team on travis-ci.org <http://travis-ci.org>, but we have no control > >>>> over it - we can't see how it's configured, how much resources are > >>>> available, how resources are allocated among Apache projects, etc. > >>>> The nice thing about having an account on travis-ci.com > >>>> <http://travis-ci.com> are: > >>>> > >>>> - relatively low cost with much better resource guarantee than what > >>>> we currently have [1]: $249/month with 5 dedicated concurrency, > >>>> $489/month with 10 concurrency > >>>> - low maintenance work compared to using Jenkins > >>>> - (potentially) no migration cost according to Travis's doc [2] > >>>> (pending verification) > >>>> - full control over the build capacity/configuration compared to > >>>> using ASF INFRA's pool > >>>> > >>>> I'd be surprised if we as such a vibrant community cannot find and > >>>> fund $249*12=$2988 a year in exchange for a much better developer > >>>> experience and much higher productivity. > >>>> > >>>> [1] https://travis-ci.com/plans > >>>> [2] > >>>> > >> > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <ches...@apache.org > >>>> <mailto:ches...@apache.org>> wrote: > >>>> > >>>> So yes, the Jenkins job keeps pulling the state from Travis > until it > >>>> finishes. > >>>> > >>>> Note sure I'm comfortable with the idea of using Jenkins workers > >>>> just to > >>>> idle for a several hours. > >>>> > >>>> On 29/06/2019 14:56, Jeff Zhang wrote: > >>>> > Here's what zeppelin community did, we make a python script to > >>>> check the > >>>> > build status of pull request. > >>>> > Here's script: > >>>> > https://github.com/apache/zeppelin/blob/master/travis_check.py > >>>> > > >>>> > And this is the script we used in Jenkins build job. > >>>> > > >>>> > if [ -f "travis_check.py" ]; then > >>>> > git log -n 1 > >>>> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > >>>> request.*from.*" | sed > >>>> > 's/.*GitHub pull request <a > >>>> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g') > >>>> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > >>>> > PR=$(echo $STATUS | awk '{print $1}' | sed > >>>> 's/.*[/]\(.*\)$/\1/g') > >>>> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}') > >>>> > #if [ -z $COMMIT ]; then > >>>> > # COMMIT=$(curl -s > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>>> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' > ' > >>>> | sed > >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v > >>>> "apache:" | > >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>>> > #fi > >>>> > > >>>> > # get commit hash from PR > >>>> > COMMIT=$(curl -s > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | > >>>> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' > >>>> | sed > >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v > >>>> "apache:" | > >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>>> > sleep 30 # sleep few moment to wait travis starts the build > >>>> > RET_CODE=0 > >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > >>>> > if [ $RET_CODE -eq 2 ]; then # try with repository name when > >>>> travis-ci is > >>>> > not available in the account > >>>> > RET_CODE=0 > >>>> > AUTHOR=$(curl -s > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>>> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed > >>>> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > >>>> > fi > >>>> > > >>>> > if [ $RET_CODE -eq 2 ]; then # fail with can't find build > >>>> information in > >>>> > the travis > >>>> > set +x > >>>> > echo > "-----------------------------------------------------" > >>>> > echo "Looks like travis-ci is not configured for your > fork." > >>>> > echo "Please setup by swich on 'zeppelin' repository at > >>>> > https://travis-ci.org/profile and travis-ci." > >>>> > echo "And then make sure 'Build branch updates' option is > >>>> enabled in > >>>> > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings > >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > >>>> > echo "" > >>>> > echo "To trigger CI after setup, you will need ammend your > >>>> last commit > >>>> > with" > >>>> > echo "git commit --amend" > >>>> > echo "git push your-remote HEAD --force" > >>>> > echo "" > >>>> > echo "See > >>>> > > >>>> > >> > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > >>>> > ." > >>>> > fi > >>>> > > >>>> > exit $RET_CODE > >>>> > else > >>>> > set +x > >>>> > echo "travis_check.py does not exists" > >>>> > exit 1 > >>>> > fi > >>>> > > >>>> > Chesnay Schepler <ches...@apache.org > >>>> <mailto:ches...@apache.org>> 于2019年6月29日周六 下午3:17写道: > >>>> > > >>>> >> Does this imply that a Jenkins job is active as long as the > >>>> Travis build > >>>> >> runs? > >>>> >> > >>>> >> On 26/06/2019 21:28, Bowen Li wrote: > >>>> >>> Hi, > >>>> >>> > >>>> >>> @Dawid, I think the "long test running" as I mentioned in the > >>>> first > >>>> >> email, > >>>> >>> also as you guys said, belongs to "a big effort which is much > >>>> harder to > >>>> >>> accomplish in a short period of time and may deserve its own > >>>> separate > >>>> >>> discussion". Thus I didn't include it in what we can do in a > >>>> foreseeable > >>>> >>> short term. > >>>> >>> > >>>> >>> Besides, I don't think that's the ultimate reason for lack of > >>>> build > >>>> >>> resources. Even if the build is shortened to something like > >>>> 2h, the > >>>> >>> problems of no build machine works about 6 or more hours in > >>>> PST daytime > >>>> >>> that I described will still happen, because no machine from > >>>> ASF INFRA's > >>>> >>> pool is allocated to Flink. As I have paid close attention to > >>>> the build > >>>> >>> queue in the past few weekdays, it's a pretty clear pattern > now. > >>>> >>> > >>>> >>> **The ultimate root cause** for that is - we don't have any > >>>> **dedicated** > >>>> >>> build resources that we can stably rely on. I'm actually ok > to > >>>> wait for a > >>>> >>> long time if there are build requests running, it means at > >>>> least we are > >>>> >>> making progress. But I'm not ok with no build resource. A > >>>> better place I > >>>> >>> think we should aim at in short term is to always have at > >>>> least a central > >>>> >>> pool (can be 3 or 5) of machines dedicated to build Flink at > >>>> any time, or > >>>> >>> maybe use users resources. > >>>> >>> > >>>> >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin > >>>> community is > >>>> >>> using a Jenkins job to automatically build on users' travis > >>>> account and > >>>> >>> link the result back to github PR. I guess the Jenkins job > >>>> would fetch > >>>> >>> latest upstream master and build the PR against it. Jeff has > >>>> filed > >>>> >> tickets > >>>> >>> to learn and get access to the Jenkins infra. It'll better to > >>>> fully > >>>> >>> understand it first before judging this approach. > >>>> >>> > >>>> >>> I also heard good things about CircleCI, and ASF INFRA seems > >>>> to have a > >>>> >> pool > >>>> >>> of build capacity there too. Can be an alternative to > consider. > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > >>>> >> dwysakow...@apache.org <mailto:dwysakow...@apache.org>> > >>>> >>> wrote: > >>>> >>> > >>>> >>>> Sorry to jump in late, but I think Bowen missed the most > >>>> important point > >>>> >>>> from Chesnay's previous message in the summary. The ultimate > >>>> reason for > >>>> >>>> all the problems is that the tests take close to 2 hours to > >>>> run already. > >>>> >>>> I fully support this claim: "Unless people start caring > about > >>>> test times > >>>> >>>> before adding them, this issue cannot be solved" > >>>> >>>> > >>>> >>>> This is also another reason why using user's Travis account > >>>> won't help. > >>>> >>>> Every few weeks we reach the user's time limit for a single > >>>> profile. > >>>> >>>> This makes the user's builds simply fail, until we either > >>>> properly > >>>> >>>> decrease the time the tests take (which I am not sure we > ever > >>>> did) or > >>>> >>>> postpone the problem by splitting into more profiles. (Note > >>>> that the ASF > >>>> >>>> Travis account has higher time limits) > >>>> >>>> > >>>> >>>> Best, > >>>> >>>> > >>>> >>>> Dawid > >>>> >>>> > >>>> >>>> On 26/06/2019 09:36, Robert Metzger wrote: > >>>> >>>>> Do we know if using "the best" available hardware would > >>>> improve the > >>>> >> build > >>>> >>>>> times? > >>>> >>>>> Imagine we would run the build on machines with plenty of > >>>> main memory > >>>> >> to > >>>> >>>>> mount everything to ramdisk + the latest CPU architecture? > >>>> >>>>> > >>>> >>>>> Throwing hardware at the problem could help reduce the time > >>>> of an > >>>> >>>>> individual build, and using our own infrastructure would > >>>> remove our > >>>> >>>>> dependency on Apache's Travis account (with the obvious > >>>> downside of > >>>> >>>> having > >>>> >>>>> to maintain the infrastructure) > >>>> >>>>> We could use an open source travis alternative, to have a > >>>> similar > >>>> >>>>> experience and make the migration easy. > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > >>>> <ches...@apache.org <mailto:ches...@apache.org>> > >>>> >>>> wrote: > >>>> >>>>>> From what I gathered, there's no special sauce that the > >>>> Zeppelin > >>>> >>>>>> project uses which actually integrates a users Travis > >>>> account into the > >>>> >>>> PR. > >>>> >>>>>> They just disabled Travis for PRs. And that's kind of it. > >>>> >>>>>> > >>>> >>>>>> Naturally we can do this (duh) and safe the ASF a fair > >>>> amount of > >>>> >>>>>> resources, but there are downsides: > >>>> >>>>>> > >>>> >>>>>> The discoverability of the Travis check takes a nose-dive. > >>>> Either we > >>>> >>>>>> require every contributor to always, an every commit, also > >>>> post a > >>>> >> Travis > >>>> >>>>>> build, or we have the reviewer sift through the > >>>> contributors account > >>>> >> to > >>>> >>>>>> find it. > >>>> >>>>>> > >>>> >>>>>> This is rather cumbersome. Additionally, it's also not > >>>> equivalent to > >>>> >>>>>> having a PR build. > >>>> >>>>>> > >>>> >>>>>> A normal branch build takes a branch as is and tests it. A > >>>> PR build > >>>> >>>>>> merges the branch into master, and then runs it. (Fun > fact: > >>>> This is > >>>> >> why > >>>> >>>>>> a PR without merge conflicts is not being run on Travis.) > >>>> >>>>>> > >>>> >>>>>> And ultimately, everyone can already make use of this > >>>> approach anyway. > >>>> >>>>>> > >>>> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: > >>>> >>>>>>> Hi Jeff, > >>>> >>>>>>> > >>>> >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a > >>>> good idea to > >>>> >>>>>>> leverage user's travis account. > >>>> >>>>>>> In this way, we can have almost unlimited concurrent > build > >>>> jobs and > >>>> >>>>>>> developers can restart build by themselves (currently > only > >>>> committers > >>>> >>>>>>> can restart PR's build). > >>>> >>>>>>> > >>>> >>>>>>> But I'm still not very clear how to integrate user's > >>>> travis build > >>>> >> into > >>>> >>>>>>> the Flink pull request's build automatically. Can you > >>>> explain more in > >>>> >>>>>>> detail? > >>>> >>>>>>> > >>>> >>>>>>> Another question: does travis only build branches for > user > >>>> account? > >>>> >>>>>>> My concern is that builds for PRs will rebase user's > >>>> commits against > >>>> >>>>>>> current master branch. > >>>> >>>>>>> This will help us to find problems before merge. Builds > >>>> for branches > >>>> >>>>>>> will lose the impact of new commits in master. > >>>> >>>>>>> How does Zeppelin solve this problem? > >>>> >>>>>>> > >>>> >>>>>>> Thanks again for sharing the idea. > >>>> >>>>>>> > >>>> >>>>>>> Regards, > >>>> >>>>>>> Jark > >>>> >>>>>>> > >>>> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang < > zjf...@gmail.com > >>>> <mailto:zjf...@gmail.com> > >>>> >>>>>>> <mailto:zjf...@gmail.com <mailto:zjf...@gmail.com>>> > wrote: > >>>> >>>>>>> > >>>> >>>>>>> Hi Folks, > >>>> >>>>>>> > >>>> >>>>>>> Zeppelin meet this kind of issue before, we solve > >>>> it by > >>>> >> delegating > >>>> >>>>>>> each > >>>> >>>>>>> one's PR build to his travis account (Everyone can > >>>> have 5 free > >>>> >>>>>>> slot for > >>>> >>>>>>> travis build). > >>>> >>>>>>> Apache account travis build is only triggered when > >>>> PR is merged. > >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> Kurt Young <ykt...@gmail.com > >>>> <mailto:ykt...@gmail.com> <mailto:ykt...@gmail.com > >>>> <mailto:ykt...@gmail.com>>> > >>>> >>>>>>> 于2019年6月25日周二 上午10:16写道: > >>>> >>>>>>> > >>>> >>>>>>> > (Forgot to cc George) > >>>> >>>>>>> > > >>>> >>>>>>> > Best, > >>>> >>>>>>> > Kurt > >>>> >>>>>>> > > >>>> >>>>>>> > > >>>> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > >>>> <ykt...@gmail.com <mailto:ykt...@gmail.com> > >>>> >>>>>>> <mailto:ykt...@gmail.com <mailto:ykt...@gmail.com>>> > >>>> wrote: > >>>> >>>>>>> > > >>>> >>>>>>> > > Hi Bowen, > >>>> >>>>>>> > > > >>>> >>>>>>> > > Thanks for bringing this up. We actually have > >>>> discussed > >>>> >> about > >>>> >>>>>>> this, and I > >>>> >>>>>>> > > think Till and George have > >>>> >>>>>>> > > already spend sometime investigating it. I have > >>>> cced both of > >>>> >>>>>>> them, and > >>>> >>>>>>> > > maybe they can share > >>>> >>>>>>> > > their findings. > >>>> >>>>>>> > > > >>>> >>>>>>> > > Best, > >>>> >>>>>>> > > Kurt > >>>> >>>>>>> > > > >>>> >>>>>>> > > > >>>> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > >>>> <imj...@gmail.com <mailto:imj...@gmail.com> > >>>> >>>>>>> <mailto:imj...@gmail.com <mailto:imj...@gmail.com>>> > >>>> wrote: > >>>> >>>>>>> > > > >>>> >>>>>>> > >> Hi Bowen, > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> Thanks for bringing this. We also suffered > from > >>>> the long > >>>> >>>>>>> build time. > >>>> >>>>>>> > >> I agree that we should focus on solving build > >>>> capacity > >>>> >>>>>>> problem in the > >>>> >>>>>>> > >> thread. > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> My observation is there is only one build is > >>>> running, all > >>>> >> the > >>>> >>>>>>> others > >>>> >>>>>>> > >> (other > >>>> >>>>>>> > >> PRs, master) are pending. > >>>> >>>>>>> > >> The pricing plan[1] of travis shows it can > >>>> support > >>>> >> concurrent > >>>> >>>>>>> build > >>>> >>>>>>> > jobs. > >>>> >>>>>>> > >> But I don't know which plan we are using, > might > >>>> be the free > >>>> >>>>>>> plan for > >>>> >>>>>>> > open > >>>> >>>>>>> > >> source. > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> I cc-ed Chesnay who may have some experience > on > >>>> Travis. > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> Regards, > >>>> >>>>>>> > >> Jark > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> [1]: https://travis-ci.com/plans > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > >>>> >> bowenl...@gmail.com <mailto:bowenl...@gmail.com> > >>>> >>>>>>> <mailto:bowenl...@gmail.com > >>>> <mailto:bowenl...@gmail.com>>> wrote: > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> > Hi Steven, > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > I think you may not read what I wrote. The > >>>> discussion is > >>>> >>>> about > >>>> >>>>>>> > "unstable > >>>> >>>>>>> > >> > build **capacity**", in another word > >>>> "unstable / lack of > >>>> >>>> build > >>>> >>>>>>> > >> resources", > >>>> >>>>>>> > >> > not "unstable build". > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu > >>>> >>>>>>> <stevenz...@gmail.com <mailto:stevenz...@gmail.com > > > >>>> <mailto:stevenz...@gmail.com <mailto:stevenz...@gmail.com>>> > >>>> >>>>>>> > wrote: > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > > long and sometimes unstable build is > >>>> definitely a pain > >>>> >>>>>> point. > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > I suspect the build failure here in > >>>> >> flink-connector-kafka > >>>> >>>>>>> is not > >>>> >>>>>>> > >> related > >>>> >>>>>>> > >> > to > >>>> >>>>>>> > >> > > my change. but there is no easy re-run the > >>>> build on > >>>> >>>>>>> travis UI. > >>>> >>>>>>> > Google > >>>> >>>>>>> > >> > > search showed a trick of close-and-open > the > >>>> PR will > >>>> >>>>>>> trigger rebuild. > >>>> >>>>>>> > >> but > >>>> >>>>>>> > >> > > that could add noises to the PR > activities. > >>>> >>>>>>> > >> > > > >>>> https://travis-ci.org/apache/flink/jobs/545555519 > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > travis-ci for my personal repo often > failed > >>>> with > >>>> >>>>>>> exceeding time > >>>> >>>>>>> > limit > >>>> >>>>>>> > >> > after > >>>> >>>>>>> > >> > > 4+ hours. > >>>> >>>>>>> > >> > > The job exceeded the maximum time limit > for > >>>> jobs, and > >>>> >> has > >>>> >>>>>>> been > >>>> >>>>>>> > >> > terminated. > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li > >>>> >>>>>>> <bowenl...@gmail.com <mailto:bowenl...@gmail.com> > >>>> <mailto:bowenl...@gmail.com <mailto:bowenl...@gmail.com>>> > >>>> >>>>>>> > wrote: > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > > > >>>> https://travis-ci.org/apache/flink/builds/549681530 > >>>> >>>>>>> This build > >>>> >>>>>>> > >> > request > >>>> >>>>>>> > >> > > > has > >>>> >>>>>>> > >> > > > been sitting at **HEAD of the queue** > >>>> since I first > >>>> >> saw > >>>> >>>>>>> it at PST > >>>> >>>>>>> > >> > 10:30am > >>>> >>>>>>> > >> > > > (not sure how long it's been there > before > >>>> 10:30am). > >>>> >>>>>>> It's PST > >>>> >>>>>>> > 4:12pm > >>>> >>>>>>> > >> now > >>>> >>>>>>> > >> > > and > >>>> >>>>>>> > >> > > > it hasn't started yet. > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li > >>>> >>>>>>> <bowenl...@gmail.com <mailto:bowenl...@gmail.com> > >>>> <mailto:bowenl...@gmail.com <mailto:bowenl...@gmail.com>>> > >>>> >>>>>>> > >> wrote: > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > Hi devs, > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > I've been experiencing the pain > >>>> resulting from lack > >>>> >>>>>>> of stable > >>>> >>>>>>> > >> build > >>>> >>>>>>> > >> > > > > capacity on Travis for Flink PRs [1]. > >>>> >> Specifically, I > >>>> >>>>>>> noticed > >>>> >>>>>>> > >> often > >>>> >>>>>>> > >> > > that > >>>> >>>>>>> > >> > > > no > >>>> >>>>>>> > >> > > > > build in the queue is making any > >>>> progress for > >>>> >> hours, > >>>> >>>> and > >>>> >>>>>>> > suddenly > >>>> >>>>>>> > >> 5 > >>>> >>>>>>> > >> > or > >>>> >>>>>>> > >> > > 6 > >>>> >>>>>>> > >> > > > > builds kick off all together after the > >>>> long pause. > >>>> >>>>>>> I'm at PST > >>>> >>>>>>> > >> > (UTC-08) > >>>> >>>>>>> > >> > > > time > >>>> >>>>>>> > >> > > > > zone, and I've seen pause can be as > >>>> long as 6 hours > >>>> >>>>>>> from PST 9am > >>>> >>>>>>> > >> to > >>>> >>>>>>> > >> > 3pm > >>>> >>>>>>> > >> > > > > (let alone the time needed to drain > the > >>>> queue > >>>> >>>>>>> afterwards). > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > I think this has greatly impacted our > >>>> productivity. > >>>> >>>> I've > >>>> >>>>>>> > >> experienced > >>>> >>>>>>> > >> > > that > >>>> >>>>>>> > >> > > > > PRs submitted in the early morning of > >>>> PST time zone > >>>> >>>>>>> won't finish > >>>> >>>>>>> > >> > their > >>>> >>>>>>> > >> > > > > build until late night of the same > day. > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > So my questions are: > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > - Has anyone else experienced the same > >>>> problem or > >>>> >>>>>>> have similar > >>>> >>>>>>> > >> > > > observation > >>>> >>>>>>> > >> > > > > on TravisCI? (I suspect it has things > >>>> to do with > >>>> >> time > >>>> >>>>>>> zone) > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > - What pricing plan of TravisCI is > >>>> Flink currently > >>>> >>>>>>> using? Is it > >>>> >>>>>>> > >> the > >>>> >>>>>>> > >> > > free > >>>> >>>>>>> > >> > > > > plan for open source projects? What > >>>> are the > >>>> >>>>>>> guaranteed build > >>>> >>>>>>> > >> capacity > >>>> >>>>>>> > >> > > of > >>>> >>>>>>> > >> > > > > the current plan? > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > - If the current pricing plan (either > >>>> free or paid) > >>>> >>>>>> can't > >>>> >>>>>>> > provide > >>>> >>>>>>> > >> > > stable > >>>> >>>>>>> > >> > > > > build capacity, can we upgrade to a > >>>> higher priced > >>>> >>>>>>> plan with > >>>> >>>>>>> > larger > >>>> >>>>>>> > >> > and > >>>> >>>>>>> > >> > > > more > >>>> >>>>>>> > >> > > > > stable build capacity? > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > BTW, another factor that contribute to > >>>> the > >>>> >>>>>>> productivity problem > >>>> >>>>>>> > is > >>>> >>>>>>> > >> > that > >>>> >>>>>>> > >> > > > > our build is slow - we run full build > >>>> for every PR > >>>> >>>> and a > >>>> >>>>>>> > >> successful > >>>> >>>>>>> > >> > > full > >>>> >>>>>>> > >> > > > > build takes ~5h. We definitely have > >>>> more options to > >>>> >>>>>>> solve it, > >>>> >>>>>>> > for > >>>> >>>>>>> > >> > > > instance, > >>>> >>>>>>> > >> > > > > modularize the build graphs and reuse > >>>> artifacts > >>>> >> from > >>>> >>>> the > >>>> >>>>>>> > previous > >>>> >>>>>>> > >> > > build. > >>>> >>>>>>> > >> > > > > But I think that can be a big effort > >>>> which is much > >>>> >>>>>>> harder to > >>>> >>>>>>> > >> > accomplish > >>>> >>>>>>> > >> > > > in > >>>> >>>>>>> > >> > > > > a short period of time and may deserve > >>>> its own > >>>> >>>> separate > >>>> >>>>>>> > >> discussion. > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > [1] > >>>> >> https://travis-ci.org/apache/flink/pull_requests > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > >>>> >>>>>>> > > > >>>> >>>>>>> > > >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> -- > >>>> >>>>>>> Best Regards > >>>> >>>>>>> > >>>> >>>>>>> Jeff Zhang > >>>> >>>>>>> > >>>> >> > >>>> > >>> > >> > >