Thanks Robert for driving this. There is another big pain point of current
travis,
which is its cache mechanism will fail from time to time. Almost around 50%
of
the build fails are caused by cache problem. I opened this issue to travis
but
got no response yet. So big +1 from my side.

Just one comment, it's close to 1.10 feature freeze and we will spend some
time
to make tests stable before release. I wish this replacement can happen
after
1.10 release, otherwise it will be a unstable factor during release
testing.

Best,
Kurt


On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <reed...@gmail.com> wrote:

> Thanks Robert for the updates! And thanks a lot for all the efforts to
> investigate, experiment and tune Azure Pipelines for Flink building.
> Big +1 for it.
>
> It would be great that the community building can be extended with custom
> machines so that the tests would not be queued for long with daily growing
> PRs.
>
> The increased timeout would be also very helpful.
> The 50min timeout for free travis accounts is a pain currently, especially
> when we'd like to run e2e tests in our own travis. And I had to manually
> split the jobs to make it possible to pass.
>
> Thanks,
> Zhu Zhu
>
> Robert Metzger <rmetz...@apache.org> 于2019年12月4日周三 下午6:36写道:
>
> > Hi all,
> >
> > as a follow up from our discussion on reducing the build time [1], I
> would
> > like to propose migrating our build infrastructure to Azure Pipelines
> (away
> > from Travis).
> >
> > I believe that we have reached the limits of what Travis can provide the
> > Flink community, and I don't want the build system to limit or influence
> > the project's growth.
> >
> > *Benefits:*
> > 1. The free Travis account are limited to 5 parallel builds, with a
> timeout
> > of 50 minutes. Azure offers *10 parallel builds with 300 minute timeouts
> > *for
> > free for open source projects.
> > 2. Azure Pipelines allows us to *add custom build machines* to the pool
> of
> > 10 free parallel builders.
> > This will allow the Flink community to scale the available build capacity
> > as the project grows. We are dependent on donations from supporting
> > companies, but I believe that it is easier for companies to donate
> machines
> > than money.
> > Alibaba is willing to provide 10 machines, with 32 cores each to the
> Flink
> > project for this purpose.
> > In addition, Xiyuan, who's working on adding ARM support for Flink
> provided
> > me with 2 ARM machines (16 cores each).
> > I want to use the custom, more efficient build machines for building
> > Flink's pull requests and master-pushes.
> > 3. *Azure Pipelines is a more feature-rich tool*, allowing for example to
> > transfer intermediate build artifacts between pipeline stages. This will
> > allow us to make the build more reliable (we are currently abusing the
> > caching mechanism in Travis for this).
> > It also has some basic analytics on test results / flaky tests etc.
> >
> > *Known problems:*
> > - Initially, we might see different build instabilities than before
> > - There's a higher maintenance overhead for the custom build machines
> > (keeping them up to date etc.)
> > - We can not use the build status integration of AZP, because they
> require
> > write access to the repository's source. The foundation does not allow
> that
> > [2].
> > I propose to extend flinkbot / the flink-ci repository.
> >
> > *Current Status:*
> > - I'm able [3] to execute [4] the current custom build scripts on Azure
> > Pipelines: This means that we will have one compile stage, and N testing
> > jobs in the 2nd stage. Currently, we have N=10 testing jobs.
> > The time from the start of a build till all tests have completed is 1h22
> > minutes.
> > - I'm working on getting the nightly end to end tests to run on the new
> > infrastructure.
> > - I'm working on getting the build to work on our pool of custom machines
> > as well
> > - I'm working on setting up the full matrix of builds (different scala,
> > hadoop etc. versions) for the nightlies
> >
> > *Next Steps:*
> > - I propose to document the entire build system in the Flink Wiki
> > - Once Azure can cover the same pull request tests as Travis, I would set
> > it up to run in parallel (including Flinkbot posting links to Azure). I
> > hope that this phase lasts for 1-2 weeks only, so that we do not have to
> > maintain things concurrently. I will monitor the build stability closely,
> > but would expect some support with debugging potential issues from the
> > contributors.
> > - Once there are no problems with the new setup, we remove the Travis
> > setup.
> > - Independently, I will work on triggering builds from master / release -
> > branch pushes, as well as cron builds from the master branch ... all this
> > will be described in the Wiki.
> >
> >
> > *Timeline:*- Once I have the feeling that people are supportive of the
> > idea, I will start documenting in the Wiki. The first pull requests
> should
> > show up after a few more days.
> > I will do a one month parental leave starting some time later in
> December,
> > which will probably delay things a bit. I hope to have everything
> finished
> > by end of January.
> >
> > I'm happy to hear your thoughts on this work.
> > If nobody objects, I will start documenting the system and prepare
> > everything for the migration.
> >
> > Best,
> > Robert
> >
> >
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E
> > [2] https://issues.apache.org/jira/browse/INFRA-17030
> > [3] https://github.com/rmetzger/flink/tree/azure_playground
> > [4]
> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary
> >
>

Reply via email to