Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-03 Thread Bowen Li
Re: > Are they using their own Travis CI pool, or did the switch to an entirely different CI service? I reached out to Wes and Krisztián from Apache Arrow PMC. They are currently moving away from ASF's Travis to their own in-house metal machines at [1] with custom CI application at [2]. They've se

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-03 Thread Chesnay Schepler
Are they using their own Travis CI pool, or did the switch to an entirely different CI service? If we can just switch to our own Travis pool, just for our project, then this might be something we can do fairly quickly? On 03/07/2019 05:55, Bowen Li wrote: I responded in the INFRA ticket [1]

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-02 Thread Bowen Li
I responded in the INFRA ticket [1] that I believe they are using a wrong metric against Flink and the total build time is a completely different thing than guaranteed build capacity. My response: "As mentioned above, since I started to pay attention to Flink's build queue a few tens of days ago,

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-02 Thread Chesnay Schepler
As a short-term stopgap, since we can assume this issue to become much worse in the following days/weeks, we could disable IT cases in PRs and only run them on master. On 02/07/2019 12:03, Chesnay Schepler wrote: People really have to stop thinking that just because something works for us it i

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-02 Thread Chesnay Schepler
People really have to stop thinking that just because something works for us it is also a good solution. Also, please remember that our builds run for 2h from start to finish, and not the 14 _minutes_ it takes for zeppelin. We are dealing with an entirely different scale here, both in terms of b

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-01 Thread Bowen Li
By looking at the git history of the Jenkins script, its core part was finished in March 2017 (and only two minor update in 2017/2018), so it's been running for over two years now and feels like Zepplin community has been quite happy with it. @Jeff Zhang can you share your insights and user experi

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-29 Thread Chesnay Schepler
So yes, the Jenkins job keeps pulling the state from Travis until it finishes. Note sure I'm comfortable with the idea of using Jenkins workers just to idle for a several hours. On 29/06/2019 14:56, Jeff Zhang wrote: Here's what zeppelin community did, we make a python script to check the bu

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-29 Thread Jeff Zhang
Here's what zeppelin community did, we make a python script to check the build status of pull request. Here's script: https://github.com/apache/zeppelin/blob/master/travis_check.py And this is the script we used in Jenkins build job. if [ -f "travis_check.py" ]; then git log -n 1 STATUS=$(cur

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-29 Thread Chesnay Schepler
Does this imply that a Jenkins job is active as long as the Travis build runs? On 26/06/2019 21:28, Bowen Li wrote: Hi, @Dawid, I think the "long test running" as I mentioned in the first email, also as you guys said, belongs to "a big effort which is much harder to accomplish in a short perio

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-27 Thread Chesnay Schepler
see https://issues.apache.org/jira/browse/INFRA-18533 for the overall degradation of Travis capacity. On 26/06/2019 21:50, Bowen wrote: just elaborate a bit more on why slow build is ok but no resource is not: Say I submit a build request at PST 9am, no other requests exist and mine is the qu

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-26 Thread Bowen
just elaborate a bit more on why slow build is ok but no resource is not: Say I submit a build request at PST 9am, no other requests exist and mine is the queue head, currently it means it still cannot get built until 4 or 5pm. > On Jun 26, 2019, at 12:28, Bowen Li wrote: > > Hi, > > @Dawid

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-26 Thread Bowen Li
Hi, @Dawid, I think the "long test running" as I mentioned in the first email, also as you guys said, belongs to "a big effort which is much harder to accomplish in a short period of time and may deserve its own separate discussion". Thus I didn't include it in what we can do in a foreseeable shor

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-26 Thread Dawid Wysakowicz
Sorry to jump in late, but I think Bowen missed the most important point from Chesnay's previous message in the summary. The ultimate reason for all the problems is that the tests take close to 2 hours to run already. I fully support this claim: "Unless people start caring about test times before a

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-26 Thread Robert Metzger
Do we know if using "the best" available hardware would improve the build times? Imagine we would run the build on machines with plenty of main memory to mount everything to ramdisk + the latest CPU architecture? Throwing hardware at the problem could help reduce the time of an individual build, a

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-26 Thread Chesnay Schepler
From what I gathered, there's no special sauce that the Zeppelin project uses which actually integrates a users Travis account into the PR. They just disabled Travis for PRs. And that's kind of it. Naturally we can do this (duh) and safe the ASF a fair amount of resources, but there are downsi

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-25 Thread Bowen Li
Want to summarize Chesnay's points for everyone reading this thread: 1) the build resources Flink is currently using belong to ASF INFRA, and 2) we are waiting on ASF INFRA's response on whether we can donate/sponsor extra build resources for Flink. I think it'll be super helpful to pay and secure

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-25 Thread Chesnay Schepler
On 24/06/2019 23:48, Bowen Li wrote: - Has anyone else experienced the same problem or have similar observation on TravisCI? (I suspect it has things to do with time zone) In Europe we have the same problem. - What pricing plan of TravisCI is Flink currently using? Is it the free plan for op

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Jark Wu
Hi Jeff, Thanks for sharing the Zeppelin approach. I think it's a good idea to leverage user's travis account. In this way, we can have almost unlimited concurrent build jobs and developers can restart build by themselves (currently only committers can restart PR's build). But I'm still not very

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Jeff Zhang
Hi Folks, Zeppelin meet this kind of issue before, we solve it by delegating each one's PR build to his travis account (Everyone can have 5 free slot for travis build). Apache account travis build is only triggered when PR is merged. Kurt Young 于2019年6月25日周二 上午10:16写道: > (Forgot to cc George)

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Kurt Young
(Forgot to cc George) Best, Kurt On Tue, Jun 25, 2019 at 10:16 AM Kurt Young wrote: > Hi Bowen, > > Thanks for bringing this up. We actually have discussed about this, and I > think Till and George have > already spend sometime investigating it. I have cced both of them, and > maybe they can s

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Kurt Young
Hi Bowen, Thanks for bringing this up. We actually have discussed about this, and I think Till and George have already spend sometime investigating it. I have cced both of them, and maybe they can share their findings. Best, Kurt On Tue, Jun 25, 2019 at 10:08 AM Jark Wu wrote: > Hi Bowen, > >

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Jark Wu
Hi Bowen, Thanks for bringing this. We also suffered from the long build time. I agree that we should focus on solving build capacity problem in the thread. My observation is there is only one build is running, all the others (other PRs, master) are pending. The pricing plan[1] of travis shows it

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Bowen Li
Hi Steven, I think you may not read what I wrote. The discussion is about "unstable build **capacity**", in another word "unstable / lack of build resources", not "unstable build". On Mon, Jun 24, 2019 at 4:40 PM Steven Wu wrote: > long and sometimes unstable build is definitely a pain point. >

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Steven Wu
long and sometimes unstable build is definitely a pain point. I suspect the build failure here in flink-connector-kafka is not related to my change. but there is no easy re-run the build on travis UI. Google search showed a trick of close-and-open the PR will trigger rebuild. but that could add no

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Bowen Li
https://travis-ci.org/apache/flink/builds/549681530 This build request has been sitting at **HEAD of the queue** since I first saw it at PST 10:30am (not sure how long it's been there before 10:30am). It's PST 4:12pm now and it hasn't started yet. On Mon, Jun 24, 2019 at 2:48 PM Bowen Li wrote:

[DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Bowen Li
Hi devs, I've been experiencing the pain resulting from lack of stable build capacity on Travis for Flink PRs [1]. Specifically, I noticed often that no build in the queue is making any progress for hours, and suddenly 5 or 6 builds kick off all together after the long pause. I'm at PST (UTC-08) t