Thanks for your feedback. I will then go for option B. On Fri, Dec 13, 2019 at 2:51 PM Till Rohrmann <trohrm...@apache.org> wrote:
> Thanks for starting this discussion Robert. > > I can see benefits for both options as already mentioned in this thread. > However, given that we already have the profile splits and that it would > considerably decrease feedback for developers on their personal Azure > accounts, I'd be in favour of option B for the time being. If we see that > we can keep the build time for the local Azure setups down differently, > then one could start simplifying the build. > > Cheers, > Till > > On Fri, Dec 13, 2019 at 2:42 PM Aljoscha Krettek <aljos...@apache.org> > wrote: > >> It’s a though question. One the one hand I like less complexity in the >> build system. But one of the most important things for developers is fast >> iteration cycles. >> >> So I would prefer the solution that keeps the iteration time low. >> >> Best, >> Aljoscha >> >> > On 13. Dec 2019, at 14:41, Chesnay Schepler <ches...@apache.org> wrote: >> > >> > It depends on how to define "split"; if you split by module (as we do >> currently) you have the same complexity as we have right now; >> > caching of artifacts and brittle definition of splits. >> > >> > But there are other ways to split builds, for example into unit and >> integration tests; could also add end-to-end tests to this list. >> > At that point we're basically talking about multiple parallel builds >> that are fully independent. >> > Let's also remember that caching of the build artifact is only useful >> when the compile times are large enough to warrant it; >> > if we only go with 2 splits in the grand scheme of things the caching >> wouldn't even be required. >> > We added the caching to Travis since at 5+ builds (and the guarantee >> for this number to go up) the compilation time was a much larger factor. >> > >> > As for the current split setup we have (as in by modules), it isn't >> just about faster feedback times; they can also be used to isolate >> components from each other. >> > I know that quite a few people appreciate the kafka/python module >> being in it's own split for example. >> > >> > On 11/12/2019 16:44, Robert Metzger wrote: >> >> Some comments on Chesnay's message: >> >> - Changing the number of splits will not reduce the complexity. >> >> - One can also use the Flink build machines by opening a PR to the >> >> "flink-ci/flink" repo, no need to open crappy PRs :) >> >> - On the number of builds being run: We currently use 4 out of 10 >> machines >> >> offered by Alibaba, and we are not yet hitting any limits. In addition >> to >> >> that, another big cloud provider has reached out to us, offering build >> >> capacity. >> >> >> >> But generally, I agree that solely relying on the build infrastructure >> of >> >> Flink is not a good option. The free Azure builds should provide a >> >> reasonable experience. >> >> >> >> >> >> On Wed, Dec 11, 2019 at 3:22 PM Chesnay Schepler <ches...@apache.org> >> wrote: >> >> >> >>> Note that for B it's not strictly necessary to maintain the current >> >>> number of splits; 2 might already be enough to bring contributor >> builds >> >>> to a more reasonable level. >> >>> >> >>> I don't think that a contributor build taking 3,5h is a viable option; >> >>> people will start disregarding their own instance and just open a PR >> >>> without having run the tests, which will naturally mean that PR >> quality >> >>> will drop. Committers probably will start working around this and push >> >>> branches into the flink repo for running tests; we have seen that in >> the >> >>> past and see this currently for e2e tests. >> >>> >> >>> This will increase the number of builds being run on the Flink >> machines >> >>> by quite a bit, obviously affecting throughput and latency.. >> >>> >> >>> On 11/12/2019 14:59, Arvid Heise wrote: >> >>>> Hi Robert, >> >>>> >> >>>> thank you very much for raising this issue and improving the build >> >>> system. >> >>>> For now, I'd like to stick to a lean solution (= option A). >> >>>> >> >>>> While option B can greatly reduce build times, it also has the habit >> of >> >>>> clogging up the build machines. Just some arbitrary numbers, but it >> >>>> currently feels like B cuts down latency by half but also uses 10 >> >>> machines >> >>>> for 30 minutes, decreasing the overall throughput significantly. >> Thus, >> >>> when >> >>>> many folks want to see their commits tested, resources quickly run >> out >> >>> and >> >>>> this in turn significantly increases latency. >> >>>> I'd like to have some more predictable build times and sacrifice some >> >>>> latency for now. >> >>>> >> >>>> It would be interesting to see if we could rearrange the project >> >>> execution >> >>>> in Maven, such that fast projects are executed first. E2E tests >> should be >> >>>> executed last, which they are somewhat, because of the project >> >>> dependencies. >> >>>> Of course, I'm very interested to improve the overall build >> experience by >> >>>> exploring other options to Maven. >> >>>> >> >>>> Best, >> >>>> >> >>>> Arvid >> >>>> >> >>>> On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <rmetz...@apache.org> >> >>> wrote: >> >>>>> Hey devs, >> >>>>> >> >>>>> I need your opinion on something: As part of our migration from >> Travis >> >>> to >> >>>>> Azure, I'm revisiting the build system of Flink. I currently see two >> >>>>> different ways of proceeding, and I would like to know your opinion >> on >> >>> the >> >>>>> two options. >> >>>>> >> >>>>> A) We build and test Flink in one "mvn clean verify" call on the CI >> >>> system. >> >>>>> B) We migrate the two staged build of one compile and N test jobs to >> >>> Azure. >> >>>>> Option A) is what we are currently running as part of testing the >> >>>>> Azure-based system. >> >>>>> >> >>>>> Pro/Cons for A) >> >>>>> + for "apache/flink" pushes and pull requests, the big testing >> machines >> >>>>> need 1:30 hours to complete (this might go up for a few minutes >> because >> >>> the >> >>>>> python tests, and some auxiliary tests are not executed yet) >> >>>>> + Our build will be easier to maintain and understand, because we >> rely >> >>> on >> >>>>> fewer scripts >> >>>>> - builds on Flink forks, using the free Azure plan currently take >> 3:30 >> >>>>> hours to complete. >> >>>>> >> >>>>> Pro/Cons for B) >> >>>>> + builds on Flink forks using the free Azure plan take 1:20 hours, >> >>>>> + Builds take 1:20 hours on the big testing machines >> >>>>> - maintenance and complexity of the build scripts >> >>>>> - the build times are a lot less predictable, because they depend >> on the >> >>>>> availability of workers. For the free plan builds, they are >> currently >> >>> fast, >> >>>>> because the test stage has 10 jobs, and Azure offers 10 parallel >> >>> workers. >> >>>>> We currently only have a total of 8 big machines, so there will >> always >> >>> be >> >>>>> some queueing. In practice, for the "apache/flink" repo, build times >> >>> will >> >>>>> be less favorable, because of the scheduling. >> >>>>> >> >>>>> >> >>>>> In my opinion, the question is mostly: Are you okay to wait 3.5 >> hours >> >>> for a >> >>>>> build to finish on your private CI, in favor of a less complex build >> >>>>> system? >> >>>>> Ideally, we'll be able to reduce these 3.5 hours by using a more >> modern >> >>>>> build tool ("gradle") in the future. >> >>>>> >> >>>>> I'm happy to hear your thoughts! >> >>>>> >> >>>>> Best, >> >>>>> Robert >> >>>>> >> >>> >> > >> >>