Thanks Robert for the updates! And thanks a lot for all the efforts to
investigate, experiment and tune Azure Pipelines for Flink building.
Big +1 for it.

It would be great that the community building can be extended with custom
machines so that the tests would not be queued for long with daily growing
PRs.

The increased timeout would be also very helpful.
The 50min timeout for free travis accounts is a pain currently, especially
when we'd like to run e2e tests in our own travis. And I had to manually
split the jobs to make it possible to pass.

Thanks,
Zhu Zhu

Robert Metzger <rmetz...@apache.org> 于2019年12月4日周三 下午6:36写道:

> Hi all,
>
> as a follow up from our discussion on reducing the build time [1], I would
> like to propose migrating our build infrastructure to Azure Pipelines (away
> from Travis).
>
> I believe that we have reached the limits of what Travis can provide the
> Flink community, and I don't want the build system to limit or influence
> the project's growth.
>
> *Benefits:*
> 1. The free Travis account are limited to 5 parallel builds, with a timeout
> of 50 minutes. Azure offers *10 parallel builds with 300 minute timeouts
> *for
> free for open source projects.
> 2. Azure Pipelines allows us to *add custom build machines* to the pool of
> 10 free parallel builders.
> This will allow the Flink community to scale the available build capacity
> as the project grows. We are dependent on donations from supporting
> companies, but I believe that it is easier for companies to donate machines
> than money.
> Alibaba is willing to provide 10 machines, with 32 cores each to the Flink
> project for this purpose.
> In addition, Xiyuan, who's working on adding ARM support for Flink provided
> me with 2 ARM machines (16 cores each).
> I want to use the custom, more efficient build machines for building
> Flink's pull requests and master-pushes.
> 3. *Azure Pipelines is a more feature-rich tool*, allowing for example to
> transfer intermediate build artifacts between pipeline stages. This will
> allow us to make the build more reliable (we are currently abusing the
> caching mechanism in Travis for this).
> It also has some basic analytics on test results / flaky tests etc.
>
> *Known problems:*
> - Initially, we might see different build instabilities than before
> - There's a higher maintenance overhead for the custom build machines
> (keeping them up to date etc.)
> - We can not use the build status integration of AZP, because they require
> write access to the repository's source. The foundation does not allow that
> [2].
> I propose to extend flinkbot / the flink-ci repository.
>
> *Current Status:*
> - I'm able [3] to execute [4] the current custom build scripts on Azure
> Pipelines: This means that we will have one compile stage, and N testing
> jobs in the 2nd stage. Currently, we have N=10 testing jobs.
> The time from the start of a build till all tests have completed is 1h22
> minutes.
> - I'm working on getting the nightly end to end tests to run on the new
> infrastructure.
> - I'm working on getting the build to work on our pool of custom machines
> as well
> - I'm working on setting up the full matrix of builds (different scala,
> hadoop etc. versions) for the nightlies
>
> *Next Steps:*
> - I propose to document the entire build system in the Flink Wiki
> - Once Azure can cover the same pull request tests as Travis, I would set
> it up to run in parallel (including Flinkbot posting links to Azure). I
> hope that this phase lasts for 1-2 weeks only, so that we do not have to
> maintain things concurrently. I will monitor the build stability closely,
> but would expect some support with debugging potential issues from the
> contributors.
> - Once there are no problems with the new setup, we remove the Travis
> setup.
> - Independently, I will work on triggering builds from master / release -
> branch pushes, as well as cron builds from the master branch ... all this
> will be described in the Wiki.
>
>
> *Timeline:*- Once I have the feeling that people are supportive of the
> idea, I will start documenting in the Wiki. The first pull requests should
> show up after a few more days.
> I will do a one month parental leave starting some time later in December,
> which will probably delay things a bit. I hope to have everything finished
> by end of January.
>
> I'm happy to hear your thoughts on this work.
> If nobody objects, I will start documenting the system and prepare
> everything for the migration.
>
> Best,
> Robert
>
>
>
> [1]
>
> https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E
> [2] https://issues.apache.org/jira/browse/INFRA-17030
> [3] https://github.com/rmetzger/flink/tree/azure_playground
> [4] https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary
>

Reply via email to