just elaborate a bit more on why slow build is ok but no resource is not: Say I submit a build request at PST 9am, no other requests exist and mine is the queue head, currently it means it still cannot get built until 4 or 5pm.
> On Jun 26, 2019, at 12:28, Bowen Li <bowenl...@gmail.com> wrote: > > Hi, > > @Dawid, I think the "long test running" as I mentioned in the first email, > also as you guys said, belongs to "a big effort which is much harder to > accomplish in a short period of time and may deserve its own separate > discussion". Thus I didn't include it in what we can do in a foreseeable > short term. > > Besides, I don't think that's the ultimate reason for lack of build > resources. Even if the build is shortened to something like 2h, the problems > of no build machine works about 6 or more hours in PST daytime that I > described will still happen, because no machine from ASF INFRA's pool is > allocated to Flink. As I have paid close attention to the build queue in the > past few weekdays, it's a pretty clear pattern now. > > **The ultimate root cause** for that is - we don't have any **dedicated** > build resources that we can stably rely on. I'm actually ok to wait for a > long time if there are build requests running, it means at least we are > making progress. But I'm not ok with no build resource. A better place I > think we should aim at in short term is to always have at least a central > pool (can be 3 or 5) of machines dedicated to build Flink at any time, or > maybe use users resources. > > @Chesnay @Robert I synced with Jeff offline that Zeppelin community is using > a Jenkins job to automatically build on users' travis account and link the > result back to github PR. I guess the Jenkins job would fetch latest upstream > master and build the PR against it. Jeff has filed tickets to learn and get > access to the Jenkins infra. It'll better to fully understand it first before > judging this approach. > > I also heard good things about CircleCI, and ASF INFRA seems to have a pool > of build capacity there too. Can be an alternative to consider. > > > > > > > > > >> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <dwysakow...@apache.org> >> wrote: >> Sorry to jump in late, but I think Bowen missed the most important point >> from Chesnay's previous message in the summary. The ultimate reason for >> all the problems is that the tests take close to 2 hours to run already. >> I fully support this claim: "Unless people start caring about test times >> before adding them, this issue cannot be solved" >> >> This is also another reason why using user's Travis account won't help. >> Every few weeks we reach the user's time limit for a single profile. >> This makes the user's builds simply fail, until we either properly >> decrease the time the tests take (which I am not sure we ever did) or >> postpone the problem by splitting into more profiles. (Note that the ASF >> Travis account has higher time limits) >> >> Best, >> >> Dawid >> >> On 26/06/2019 09:36, Robert Metzger wrote: >> > Do we know if using "the best" available hardware would improve the build >> > times? >> > Imagine we would run the build on machines with plenty of main memory to >> > mount everything to ramdisk + the latest CPU architecture? >> > >> > Throwing hardware at the problem could help reduce the time of an >> > individual build, and using our own infrastructure would remove our >> > dependency on Apache's Travis account (with the obvious downside of having >> > to maintain the infrastructure) >> > We could use an open source travis alternative, to have a similar >> > experience and make the migration easy. >> > >> > >> > On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <ches...@apache.org> >> > wrote: >> > >> >> From what I gathered, there's no special sauce that the Zeppelin >> >> project uses which actually integrates a users Travis account into the PR. >> >> >> >> They just disabled Travis for PRs. And that's kind of it. >> >> >> >> Naturally we can do this (duh) and safe the ASF a fair amount of >> >> resources, but there are downsides: >> >> >> >> The discoverability of the Travis check takes a nose-dive. Either we >> >> require every contributor to always, an every commit, also post a Travis >> >> build, or we have the reviewer sift through the contributors account to >> >> find it. >> >> >> >> This is rather cumbersome. Additionally, it's also not equivalent to >> >> having a PR build. >> >> >> >> A normal branch build takes a branch as is and tests it. A PR build >> >> merges the branch into master, and then runs it. (Fun fact: This is why >> >> a PR without merge conflicts is not being run on Travis.) >> >> >> >> And ultimately, everyone can already make use of this approach anyway. >> >> >> >> On 25/06/2019 08:02, Jark Wu wrote: >> >>> Hi Jeff, >> >>> >> >>> Thanks for sharing the Zeppelin approach. I think it's a good idea to >> >>> leverage user's travis account. >> >>> In this way, we can have almost unlimited concurrent build jobs and >> >>> developers can restart build by themselves (currently only committers >> >>> can restart PR's build). >> >>> >> >>> But I'm still not very clear how to integrate user's travis build into >> >>> the Flink pull request's build automatically. Can you explain more in >> >>> detail? >> >>> >> >>> Another question: does travis only build branches for user account? >> >>> My concern is that builds for PRs will rebase user's commits against >> >>> current master branch. >> >>> This will help us to find problems before merge. Builds for branches >> >>> will lose the impact of new commits in master. >> >>> How does Zeppelin solve this problem? >> >>> >> >>> Thanks again for sharing the idea. >> >>> >> >>> Regards, >> >>> Jark >> >>> >> >>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <zjf...@gmail.com >> >>> <mailto:zjf...@gmail.com>> wrote: >> >>> >> >>> Hi Folks, >> >>> >> >>> Zeppelin meet this kind of issue before, we solve it by delegating >> >>> each >> >>> one's PR build to his travis account (Everyone can have 5 free >> >>> slot for >> >>> travis build). >> >>> Apache account travis build is only triggered when PR is merged. >> >>> >> >>> >> >>> >> >>> Kurt Young <ykt...@gmail.com <mailto:ykt...@gmail.com>> >> >>> 于2019年6月25日周二 上午10:16写道: >> >>> >> >>> > (Forgot to cc George) >> >>> > >> >>> > Best, >> >>> > Kurt >> >>> > >> >>> > >> >>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <ykt...@gmail.com >> >>> <mailto:ykt...@gmail.com>> wrote: >> >>> > >> >>> > > Hi Bowen, >> >>> > > >> >>> > > Thanks for bringing this up. We actually have discussed about >> >>> this, and I >> >>> > > think Till and George have >> >>> > > already spend sometime investigating it. I have cced both of >> >>> them, and >> >>> > > maybe they can share >> >>> > > their findings. >> >>> > > >> >>> > > Best, >> >>> > > Kurt >> >>> > > >> >>> > > >> >>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <imj...@gmail.com >> >>> <mailto:imj...@gmail.com>> wrote: >> >>> > > >> >>> > >> Hi Bowen, >> >>> > >> >> >>> > >> Thanks for bringing this. We also suffered from the long >> >>> build time. >> >>> > >> I agree that we should focus on solving build capacity >> >>> problem in the >> >>> > >> thread. >> >>> > >> >> >>> > >> My observation is there is only one build is running, all the >> >>> others >> >>> > >> (other >> >>> > >> PRs, master) are pending. >> >>> > >> The pricing plan[1] of travis shows it can support concurrent >> >>> build >> >>> > jobs. >> >>> > >> But I don't know which plan we are using, might be the free >> >>> plan for >> >>> > open >> >>> > >> source. >> >>> > >> >> >>> > >> I cc-ed Chesnay who may have some experience on Travis. >> >>> > >> >> >>> > >> Regards, >> >>> > >> Jark >> >>> > >> >> >>> > >> [1]: https://travis-ci.com/plans >> >>> > >> >> >>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <bowenl...@gmail.com >> >>> <mailto:bowenl...@gmail.com>> wrote: >> >>> > >> >> >>> > >> > Hi Steven, >> >>> > >> > >> >>> > >> > I think you may not read what I wrote. The discussion is about >> >>> > "unstable >> >>> > >> > build **capacity**", in another word "unstable / lack of build >> >>> > >> resources", >> >>> > >> > not "unstable build". >> >>> > >> > >> >>> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu >> >>> <stevenz...@gmail.com <mailto:stevenz...@gmail.com>> >> >>> > wrote: >> >>> > >> > >> >>> > >> > > long and sometimes unstable build is definitely a pain >> >> point. >> >>> > >> > > >> >>> > >> > > I suspect the build failure here in flink-connector-kafka >> >>> is not >> >>> > >> related >> >>> > >> > to >> >>> > >> > > my change. but there is no easy re-run the build on >> >>> travis UI. >> >>> > Google >> >>> > >> > > search showed a trick of close-and-open the PR will >> >>> trigger rebuild. >> >>> > >> but >> >>> > >> > > that could add noises to the PR activities. >> >>> > >> > > https://travis-ci.org/apache/flink/jobs/545555519 >> >>> > >> > > >> >>> > >> > > travis-ci for my personal repo often failed with >> >>> exceeding time >> >>> > limit >> >>> > >> > after >> >>> > >> > > 4+ hours. >> >>> > >> > > The job exceeded the maximum time limit for jobs, and has >> >>> been >> >>> > >> > terminated. >> >>> > >> > > >> >>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li >> >>> <bowenl...@gmail.com <mailto:bowenl...@gmail.com>> >> >>> > wrote: >> >>> > >> > > >> >>> > >> > > > https://travis-ci.org/apache/flink/builds/549681530 >> >>> This build >> >>> > >> > request >> >>> > >> > > > has >> >>> > >> > > > been sitting at **HEAD of the queue** since I first saw >> >>> it at PST >> >>> > >> > 10:30am >> >>> > >> > > > (not sure how long it's been there before 10:30am). >> >>> It's PST >> >>> > 4:12pm >> >>> > >> now >> >>> > >> > > and >> >>> > >> > > > it hasn't started yet. >> >>> > >> > > > >> >>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li >> >>> <bowenl...@gmail.com <mailto:bowenl...@gmail.com>> >> >>> > >> wrote: >> >>> > >> > > > >> >>> > >> > > > > Hi devs, >> >>> > >> > > > > >> >>> > >> > > > > I've been experiencing the pain resulting from lack >> >>> of stable >> >>> > >> build >> >>> > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I >> >>> noticed >> >>> > >> often >> >>> > >> > > that >> >>> > >> > > > no >> >>> > >> > > > > build in the queue is making any progress for hours, and >> >>> > suddenly >> >>> > >> 5 >> >>> > >> > or >> >>> > >> > > 6 >> >>> > >> > > > > builds kick off all together after the long pause. >> >>> I'm at PST >> >>> > >> > (UTC-08) >> >>> > >> > > > time >> >>> > >> > > > > zone, and I've seen pause can be as long as 6 hours >> >>> from PST 9am >> >>> > >> to >> >>> > >> > 3pm >> >>> > >> > > > > (let alone the time needed to drain the queue >> >>> afterwards). >> >>> > >> > > > > >> >>> > >> > > > > I think this has greatly impacted our productivity. I've >> >>> > >> experienced >> >>> > >> > > that >> >>> > >> > > > > PRs submitted in the early morning of PST time zone >> >>> won't finish >> >>> > >> > their >> >>> > >> > > > > build until late night of the same day. >> >>> > >> > > > > >> >>> > >> > > > > So my questions are: >> >>> > >> > > > > >> >>> > >> > > > > - Has anyone else experienced the same problem or >> >>> have similar >> >>> > >> > > > observation >> >>> > >> > > > > on TravisCI? (I suspect it has things to do with time >> >>> zone) >> >>> > >> > > > > >> >>> > >> > > > > - What pricing plan of TravisCI is Flink currently >> >>> using? Is it >> >>> > >> the >> >>> > >> > > free >> >>> > >> > > > > plan for open source projects? What are the >> >>> guaranteed build >> >>> > >> capacity >> >>> > >> > > of >> >>> > >> > > > > the current plan? >> >>> > >> > > > > >> >>> > >> > > > > - If the current pricing plan (either free or paid) >> >> can't >> >>> > provide >> >>> > >> > > stable >> >>> > >> > > > > build capacity, can we upgrade to a higher priced >> >>> plan with >> >>> > larger >> >>> > >> > and >> >>> > >> > > > more >> >>> > >> > > > > stable build capacity? >> >>> > >> > > > > >> >>> > >> > > > > BTW, another factor that contribute to the >> >>> productivity problem >> >>> > is >> >>> > >> > that >> >>> > >> > > > > our build is slow - we run full build for every PR and a >> >>> > >> successful >> >>> > >> > > full >> >>> > >> > > > > build takes ~5h. We definitely have more options to >> >>> solve it, >> >>> > for >> >>> > >> > > > instance, >> >>> > >> > > > > modularize the build graphs and reuse artifacts from the >> >>> > previous >> >>> > >> > > build. >> >>> > >> > > > > But I think that can be a big effort which is much >> >>> harder to >> >>> > >> > accomplish >> >>> > >> > > > in >> >>> > >> > > > > a short period of time and may deserve its own separate >> >>> > >> discussion. >> >>> > >> > > > > >> >>> > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests >> >>> > >> > > > > >> >>> > >> > > > > >> >>> > >> > > > >> >>> > >> > > >> >>> > >> > >> >>> > >> >> >>> > > >> >>> > >> >>> >> >>> >> >>> -- >> >>> Best Regards >> >>> >> >>> Jeff Zhang >> >>> >> >> >>