Thanks Robert for driving this. There is another big pain point of current travis, which is its cache mechanism will fail from time to time. Almost around 50% of the build fails are caused by cache problem. I opened this issue to travis but got no response yet. So big +1 from my side.
Just one comment, it's close to 1.10 feature freeze and we will spend some time to make tests stable before release. I wish this replacement can happen after 1.10 release, otherwise it will be a unstable factor during release testing. Best, Kurt On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <reed...@gmail.com> wrote: > Thanks Robert for the updates! And thanks a lot for all the efforts to > investigate, experiment and tune Azure Pipelines for Flink building. > Big +1 for it. > > It would be great that the community building can be extended with custom > machines so that the tests would not be queued for long with daily growing > PRs. > > The increased timeout would be also very helpful. > The 50min timeout for free travis accounts is a pain currently, especially > when we'd like to run e2e tests in our own travis. And I had to manually > split the jobs to make it possible to pass. > > Thanks, > Zhu Zhu > > Robert Metzger <rmetz...@apache.org> 于2019年12月4日周三 下午6:36写道: > > > Hi all, > > > > as a follow up from our discussion on reducing the build time [1], I > would > > like to propose migrating our build infrastructure to Azure Pipelines > (away > > from Travis). > > > > I believe that we have reached the limits of what Travis can provide the > > Flink community, and I don't want the build system to limit or influence > > the project's growth. > > > > *Benefits:* > > 1. The free Travis account are limited to 5 parallel builds, with a > timeout > > of 50 minutes. Azure offers *10 parallel builds with 300 minute timeouts > > *for > > free for open source projects. > > 2. Azure Pipelines allows us to *add custom build machines* to the pool > of > > 10 free parallel builders. > > This will allow the Flink community to scale the available build capacity > > as the project grows. We are dependent on donations from supporting > > companies, but I believe that it is easier for companies to donate > machines > > than money. > > Alibaba is willing to provide 10 machines, with 32 cores each to the > Flink > > project for this purpose. > > In addition, Xiyuan, who's working on adding ARM support for Flink > provided > > me with 2 ARM machines (16 cores each). > > I want to use the custom, more efficient build machines for building > > Flink's pull requests and master-pushes. > > 3. *Azure Pipelines is a more feature-rich tool*, allowing for example to > > transfer intermediate build artifacts between pipeline stages. This will > > allow us to make the build more reliable (we are currently abusing the > > caching mechanism in Travis for this). > > It also has some basic analytics on test results / flaky tests etc. > > > > *Known problems:* > > - Initially, we might see different build instabilities than before > > - There's a higher maintenance overhead for the custom build machines > > (keeping them up to date etc.) > > - We can not use the build status integration of AZP, because they > require > > write access to the repository's source. The foundation does not allow > that > > [2]. > > I propose to extend flinkbot / the flink-ci repository. > > > > *Current Status:* > > - I'm able [3] to execute [4] the current custom build scripts on Azure > > Pipelines: This means that we will have one compile stage, and N testing > > jobs in the 2nd stage. Currently, we have N=10 testing jobs. > > The time from the start of a build till all tests have completed is 1h22 > > minutes. > > - I'm working on getting the nightly end to end tests to run on the new > > infrastructure. > > - I'm working on getting the build to work on our pool of custom machines > > as well > > - I'm working on setting up the full matrix of builds (different scala, > > hadoop etc. versions) for the nightlies > > > > *Next Steps:* > > - I propose to document the entire build system in the Flink Wiki > > - Once Azure can cover the same pull request tests as Travis, I would set > > it up to run in parallel (including Flinkbot posting links to Azure). I > > hope that this phase lasts for 1-2 weeks only, so that we do not have to > > maintain things concurrently. I will monitor the build stability closely, > > but would expect some support with debugging potential issues from the > > contributors. > > - Once there are no problems with the new setup, we remove the Travis > > setup. > > - Independently, I will work on triggering builds from master / release - > > branch pushes, as well as cron builds from the master branch ... all this > > will be described in the Wiki. > > > > > > *Timeline:*- Once I have the feeling that people are supportive of the > > idea, I will start documenting in the Wiki. The first pull requests > should > > show up after a few more days. > > I will do a one month parental leave starting some time later in > December, > > which will probably delay things a bit. I hope to have everything > finished > > by end of January. > > > > I'm happy to hear your thoughts on this work. > > If nobody objects, I will start documenting the system and prepare > > everything for the migration. > > > > Best, > > Robert > > > > > > > > [1] > > > > > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > > [2] https://issues.apache.org/jira/browse/INFRA-17030 > > [3] https://github.com/rmetzger/flink/tree/azure_playground > > [4] > https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > > >