By looking at the git history of the Jenkins script, its core part was finished in March 2017 (and only two minor update in 2017/2018), so it's been running for over two years now and feels like Zepplin community has been quite happy with it. @Jeff Zhang <zjf...@gmail.com> can you share your insights and user experience with the Jenkins+Travis approach?
Things like: - has the approach completely solved the resource capacity problem for Zepplin community? is Zepplin community happy with the result? - is the whole configuration chain stable (e.g. uptime) enough? - how often do you need to maintain the Jenkins infra? how many people are usually involved in maintenance and bug-fixes? The downside of this approach seems mostly to be on the maintenance to me - maintain the script and Jenkins infra. ** Having Our Own Travis-CI.com Account ** Another alternative I've been thinking of is to have our own travis-ci.com account with paid dedicated resources. Note travis-ci.org is the free version and travis-ci.com is the commercial version. We currently use a shared resource pool managed by ASK INFRA team on travis-ci.org, but we have no control over it - we can't see how it's configured, how much resources are available, how resources are allocated among Apache projects, etc. The nice thing about having an account on travis-ci.com are: - relatively low cost with much better resource guarantee than what we currently have [1]: $249/month with 5 dedicated concurrency, $489/month with 10 concurrency - low maintenance work compared to using Jenkins - (potentially) no migration cost according to Travis's doc [2] (pending verification) - full control over the build capacity/configuration compared to using ASF INFRA's pool I'd be surprised if we as such a vibrant community cannot find and fund $249*12=$2988 a year in exchange for a much better developer experience and much higher productivity. [1] https://travis-ci.com/plans [2] https://docs.travis-ci.com/user/migrate/open-source-repository-migration On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <ches...@apache.org> wrote: > So yes, the Jenkins job keeps pulling the state from Travis until it > finishes. > > Note sure I'm comfortable with the idea of using Jenkins workers just to > idle for a several hours. > > On 29/06/2019 14:56, Jeff Zhang wrote: > > Here's what zeppelin community did, we make a python script to check the > > build status of pull request. > > Here's script: > > https://github.com/apache/zeppelin/blob/master/travis_check.py > > > > And this is the script we used in Jenkins build job. > > > > if [ -f "travis_check.py" ]; then > > git log -n 1 > > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull request.*from.*" | > sed > > 's/.*GitHub pull request <a > > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g') > > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > > PR=$(echo $STATUS | awk '{print $1}' | sed 's/.*[/]\(.*\)$/\1/g') > > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}') > > #if [ -z $COMMIT ]; then > > # COMMIT=$(curl -s > https://api.github.com/repos/apache/zeppelin/pulls/$PR > > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed > > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" > | > > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > #fi > > > > # get commit hash from PR > > COMMIT=$(curl -s > https://api.github.com/repos/apache/zeppelin/pulls/$PR | > > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed > > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" > | > > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > sleep 30 # sleep few moment to wait travis starts the build > > RET_CODE=0 > > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > > if [ $RET_CODE -eq 2 ]; then # try with repository name when > travis-ci is > > not available in the account > > RET_CODE=0 > > AUTHOR=$(curl -s > https://api.github.com/repos/apache/zeppelin/pulls/$PR > > | grep '"full_name":' | grep -v "apache/zeppelin" | sed > > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > > fi > > > > if [ $RET_CODE -eq 2 ]; then # fail with can't find build information > in > > the travis > > set +x > > echo "-----------------------------------------------------" > > echo "Looks like travis-ci is not configured for your fork." > > echo "Please setup by swich on 'zeppelin' repository at > > https://travis-ci.org/profile and travis-ci." > > echo "And then make sure 'Build branch updates' option is enabled in > > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings." > > echo "" > > echo "To trigger CI after setup, you will need ammend your last > commit > > with" > > echo "git commit --amend" > > echo "git push your-remote HEAD --force" > > echo "" > > echo "See > > > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > > ." > > fi > > > > exit $RET_CODE > > else > > set +x > > echo "travis_check.py does not exists" > > exit 1 > > fi > > > > Chesnay Schepler <ches...@apache.org> 于2019年6月29日周六 下午3:17写道: > > > >> Does this imply that a Jenkins job is active as long as the Travis build > >> runs? > >> > >> On 26/06/2019 21:28, Bowen Li wrote: > >>> Hi, > >>> > >>> @Dawid, I think the "long test running" as I mentioned in the first > >> email, > >>> also as you guys said, belongs to "a big effort which is much harder to > >>> accomplish in a short period of time and may deserve its own separate > >>> discussion". Thus I didn't include it in what we can do in a > foreseeable > >>> short term. > >>> > >>> Besides, I don't think that's the ultimate reason for lack of build > >>> resources. Even if the build is shortened to something like 2h, the > >>> problems of no build machine works about 6 or more hours in PST daytime > >>> that I described will still happen, because no machine from ASF INFRA's > >>> pool is allocated to Flink. As I have paid close attention to the build > >>> queue in the past few weekdays, it's a pretty clear pattern now. > >>> > >>> **The ultimate root cause** for that is - we don't have any > **dedicated** > >>> build resources that we can stably rely on. I'm actually ok to wait > for a > >>> long time if there are build requests running, it means at least we are > >>> making progress. But I'm not ok with no build resource. A better place > I > >>> think we should aim at in short term is to always have at least a > central > >>> pool (can be 3 or 5) of machines dedicated to build Flink at any time, > or > >>> maybe use users resources. > >>> > >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin community is > >>> using a Jenkins job to automatically build on users' travis account and > >>> link the result back to github PR. I guess the Jenkins job would fetch > >>> latest upstream master and build the PR against it. Jeff has filed > >> tickets > >>> to learn and get access to the Jenkins infra. It'll better to fully > >>> understand it first before judging this approach. > >>> > >>> I also heard good things about CircleCI, and ASF INFRA seems to have a > >> pool > >>> of build capacity there too. Can be an alternative to consider. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > >> dwysakow...@apache.org> > >>> wrote: > >>> > >>>> Sorry to jump in late, but I think Bowen missed the most important > point > >>>> from Chesnay's previous message in the summary. The ultimate reason > for > >>>> all the problems is that the tests take close to 2 hours to run > already. > >>>> I fully support this claim: "Unless people start caring about test > times > >>>> before adding them, this issue cannot be solved" > >>>> > >>>> This is also another reason why using user's Travis account won't > help. > >>>> Every few weeks we reach the user's time limit for a single profile. > >>>> This makes the user's builds simply fail, until we either properly > >>>> decrease the time the tests take (which I am not sure we ever did) or > >>>> postpone the problem by splitting into more profiles. (Note that the > ASF > >>>> Travis account has higher time limits) > >>>> > >>>> Best, > >>>> > >>>> Dawid > >>>> > >>>> On 26/06/2019 09:36, Robert Metzger wrote: > >>>>> Do we know if using "the best" available hardware would improve the > >> build > >>>>> times? > >>>>> Imagine we would run the build on machines with plenty of main memory > >> to > >>>>> mount everything to ramdisk + the latest CPU architecture? > >>>>> > >>>>> Throwing hardware at the problem could help reduce the time of an > >>>>> individual build, and using our own infrastructure would remove our > >>>>> dependency on Apache's Travis account (with the obvious downside of > >>>> having > >>>>> to maintain the infrastructure) > >>>>> We could use an open source travis alternative, to have a similar > >>>>> experience and make the migration easy. > >>>>> > >>>>> > >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <ches...@apache.org > > > >>>> wrote: > >>>>>> From what I gathered, there's no special sauce that the Zeppelin > >>>>>> project uses which actually integrates a users Travis account into > the > >>>> PR. > >>>>>> They just disabled Travis for PRs. And that's kind of it. > >>>>>> > >>>>>> Naturally we can do this (duh) and safe the ASF a fair amount of > >>>>>> resources, but there are downsides: > >>>>>> > >>>>>> The discoverability of the Travis check takes a nose-dive. Either we > >>>>>> require every contributor to always, an every commit, also post a > >> Travis > >>>>>> build, or we have the reviewer sift through the contributors account > >> to > >>>>>> find it. > >>>>>> > >>>>>> This is rather cumbersome. Additionally, it's also not equivalent to > >>>>>> having a PR build. > >>>>>> > >>>>>> A normal branch build takes a branch as is and tests it. A PR build > >>>>>> merges the branch into master, and then runs it. (Fun fact: This is > >> why > >>>>>> a PR without merge conflicts is not being run on Travis.) > >>>>>> > >>>>>> And ultimately, everyone can already make use of this approach > anyway. > >>>>>> > >>>>>> On 25/06/2019 08:02, Jark Wu wrote: > >>>>>>> Hi Jeff, > >>>>>>> > >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a good idea > to > >>>>>>> leverage user's travis account. > >>>>>>> In this way, we can have almost unlimited concurrent build jobs and > >>>>>>> developers can restart build by themselves (currently only > committers > >>>>>>> can restart PR's build). > >>>>>>> > >>>>>>> But I'm still not very clear how to integrate user's travis build > >> into > >>>>>>> the Flink pull request's build automatically. Can you explain more > in > >>>>>>> detail? > >>>>>>> > >>>>>>> Another question: does travis only build branches for user account? > >>>>>>> My concern is that builds for PRs will rebase user's commits > against > >>>>>>> current master branch. > >>>>>>> This will help us to find problems before merge. Builds for > branches > >>>>>>> will lose the impact of new commits in master. > >>>>>>> How does Zeppelin solve this problem? > >>>>>>> > >>>>>>> Thanks again for sharing the idea. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Jark > >>>>>>> > >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <zjf...@gmail.com > >>>>>>> <mailto:zjf...@gmail.com>> wrote: > >>>>>>> > >>>>>>> Hi Folks, > >>>>>>> > >>>>>>> Zeppelin meet this kind of issue before, we solve it by > >> delegating > >>>>>>> each > >>>>>>> one's PR build to his travis account (Everyone can have 5 > free > >>>>>>> slot for > >>>>>>> travis build). > >>>>>>> Apache account travis build is only triggered when PR is > merged. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Kurt Young <ykt...@gmail.com <mailto:ykt...@gmail.com>> > >>>>>>> 于2019年6月25日周二 上午10:16写道: > >>>>>>> > >>>>>>> > (Forgot to cc George) > >>>>>>> > > >>>>>>> > Best, > >>>>>>> > Kurt > >>>>>>> > > >>>>>>> > > >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young < > ykt...@gmail.com > >>>>>>> <mailto:ykt...@gmail.com>> wrote: > >>>>>>> > > >>>>>>> > > Hi Bowen, > >>>>>>> > > > >>>>>>> > > Thanks for bringing this up. We actually have discussed > >> about > >>>>>>> this, and I > >>>>>>> > > think Till and George have > >>>>>>> > > already spend sometime investigating it. I have cced > both of > >>>>>>> them, and > >>>>>>> > > maybe they can share > >>>>>>> > > their findings. > >>>>>>> > > > >>>>>>> > > Best, > >>>>>>> > > Kurt > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu < > imj...@gmail.com > >>>>>>> <mailto:imj...@gmail.com>> wrote: > >>>>>>> > > > >>>>>>> > >> Hi Bowen, > >>>>>>> > >> > >>>>>>> > >> Thanks for bringing this. We also suffered from the long > >>>>>>> build time. > >>>>>>> > >> I agree that we should focus on solving build capacity > >>>>>>> problem in the > >>>>>>> > >> thread. > >>>>>>> > >> > >>>>>>> > >> My observation is there is only one build is running, > all > >> the > >>>>>>> others > >>>>>>> > >> (other > >>>>>>> > >> PRs, master) are pending. > >>>>>>> > >> The pricing plan[1] of travis shows it can support > >> concurrent > >>>>>>> build > >>>>>>> > jobs. > >>>>>>> > >> But I don't know which plan we are using, might be the > free > >>>>>>> plan for > >>>>>>> > open > >>>>>>> > >> source. > >>>>>>> > >> > >>>>>>> > >> I cc-ed Chesnay who may have some experience on Travis. > >>>>>>> > >> > >>>>>>> > >> Regards, > >>>>>>> > >> Jark > >>>>>>> > >> > >>>>>>> > >> [1]: https://travis-ci.com/plans > >>>>>>> > >> > >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > >> bowenl...@gmail.com > >>>>>>> <mailto:bowenl...@gmail.com>> wrote: > >>>>>>> > >> > >>>>>>> > >> > Hi Steven, > >>>>>>> > >> > > >>>>>>> > >> > I think you may not read what I wrote. The discussion > is > >>>> about > >>>>>>> > "unstable > >>>>>>> > >> > build **capacity**", in another word "unstable / lack > of > >>>> build > >>>>>>> > >> resources", > >>>>>>> > >> > not "unstable build". > >>>>>>> > >> > > >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu > >>>>>>> <stevenz...@gmail.com <mailto:stevenz...@gmail.com>> > >>>>>>> > wrote: > >>>>>>> > >> > > >>>>>>> > >> > > long and sometimes unstable build is definitely a > pain > >>>>>> point. > >>>>>>> > >> > > > >>>>>>> > >> > > I suspect the build failure here in > >> flink-connector-kafka > >>>>>>> is not > >>>>>>> > >> related > >>>>>>> > >> > to > >>>>>>> > >> > > my change. but there is no easy re-run the build on > >>>>>>> travis UI. > >>>>>>> > Google > >>>>>>> > >> > > search showed a trick of close-and-open the PR will > >>>>>>> trigger rebuild. > >>>>>>> > >> but > >>>>>>> > >> > > that could add noises to the PR activities. > >>>>>>> > >> > > https://travis-ci.org/apache/flink/jobs/545555519 > >>>>>>> > >> > > > >>>>>>> > >> > > travis-ci for my personal repo often failed with > >>>>>>> exceeding time > >>>>>>> > limit > >>>>>>> > >> > after > >>>>>>> > >> > > 4+ hours. > >>>>>>> > >> > > The job exceeded the maximum time limit for jobs, > and > >> has > >>>>>>> been > >>>>>>> > >> > terminated. > >>>>>>> > >> > > > >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li > >>>>>>> <bowenl...@gmail.com <mailto:bowenl...@gmail.com>> > >>>>>>> > wrote: > >>>>>>> > >> > > > >>>>>>> > >> > > > > https://travis-ci.org/apache/flink/builds/549681530 > >>>>>>> This build > >>>>>>> > >> > request > >>>>>>> > >> > > > has > >>>>>>> > >> > > > been sitting at **HEAD of the queue** since I > first > >> saw > >>>>>>> it at PST > >>>>>>> > >> > 10:30am > >>>>>>> > >> > > > (not sure how long it's been there before > 10:30am). > >>>>>>> It's PST > >>>>>>> > 4:12pm > >>>>>>> > >> now > >>>>>>> > >> > > and > >>>>>>> > >> > > > it hasn't started yet. > >>>>>>> > >> > > > > >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li > >>>>>>> <bowenl...@gmail.com <mailto:bowenl...@gmail.com>> > >>>>>>> > >> wrote: > >>>>>>> > >> > > > > >>>>>>> > >> > > > > Hi devs, > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > I've been experiencing the pain resulting from > lack > >>>>>>> of stable > >>>>>>> > >> build > >>>>>>> > >> > > > > capacity on Travis for Flink PRs [1]. > >> Specifically, I > >>>>>>> noticed > >>>>>>> > >> often > >>>>>>> > >> > > that > >>>>>>> > >> > > > no > >>>>>>> > >> > > > > build in the queue is making any progress for > >> hours, > >>>> and > >>>>>>> > suddenly > >>>>>>> > >> 5 > >>>>>>> > >> > or > >>>>>>> > >> > > 6 > >>>>>>> > >> > > > > builds kick off all together after the long > pause. > >>>>>>> I'm at PST > >>>>>>> > >> > (UTC-08) > >>>>>>> > >> > > > time > >>>>>>> > >> > > > > zone, and I've seen pause can be as long as 6 > hours > >>>>>>> from PST 9am > >>>>>>> > >> to > >>>>>>> > >> > 3pm > >>>>>>> > >> > > > > (let alone the time needed to drain the queue > >>>>>>> afterwards). > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > I think this has greatly impacted our > productivity. > >>>> I've > >>>>>>> > >> experienced > >>>>>>> > >> > > that > >>>>>>> > >> > > > > PRs submitted in the early morning of PST time > zone > >>>>>>> won't finish > >>>>>>> > >> > their > >>>>>>> > >> > > > > build until late night of the same day. > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > So my questions are: > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > - Has anyone else experienced the same problem > or > >>>>>>> have similar > >>>>>>> > >> > > > observation > >>>>>>> > >> > > > > on TravisCI? (I suspect it has things to do with > >> time > >>>>>>> zone) > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > - What pricing plan of TravisCI is Flink > currently > >>>>>>> using? Is it > >>>>>>> > >> the > >>>>>>> > >> > > free > >>>>>>> > >> > > > > plan for open source projects? What are the > >>>>>>> guaranteed build > >>>>>>> > >> capacity > >>>>>>> > >> > > of > >>>>>>> > >> > > > > the current plan? > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > - If the current pricing plan (either free or > paid) > >>>>>> can't > >>>>>>> > provide > >>>>>>> > >> > > stable > >>>>>>> > >> > > > > build capacity, can we upgrade to a higher > priced > >>>>>>> plan with > >>>>>>> > larger > >>>>>>> > >> > and > >>>>>>> > >> > > > more > >>>>>>> > >> > > > > stable build capacity? > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > BTW, another factor that contribute to the > >>>>>>> productivity problem > >>>>>>> > is > >>>>>>> > >> > that > >>>>>>> > >> > > > > our build is slow - we run full build for every > PR > >>>> and a > >>>>>>> > >> successful > >>>>>>> > >> > > full > >>>>>>> > >> > > > > build takes ~5h. We definitely have more > options to > >>>>>>> solve it, > >>>>>>> > for > >>>>>>> > >> > > > instance, > >>>>>>> > >> > > > > modularize the build graphs and reuse artifacts > >> from > >>>> the > >>>>>>> > previous > >>>>>>> > >> > > build. > >>>>>>> > >> > > > > But I think that can be a big effort which is > much > >>>>>>> harder to > >>>>>>> > >> > accomplish > >>>>>>> > >> > > > in > >>>>>>> > >> > > > > a short period of time and may deserve its own > >>>> separate > >>>>>>> > >> discussion. > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > [1] > >> https://travis-ci.org/apache/flink/pull_requests > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > >>>>>>> > >> > > > >>>>>>> > >> > > >>>>>>> > >> > >>>>>>> > > > >>>>>>> > > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Best Regards > >>>>>>> > >>>>>>> Jeff Zhang > >>>>>>> > >> > >