FYI: I have moved the Flink PR and master builds from my personal Azure account to a PMC controlled account: https://dev.azure.com/apache-flink/apache-flink/_build
On Fri, Apr 17, 2020 at 8:28 PM Robert Metzger <rmetz...@apache.org> wrote: > Thanks a lot for bringing up this topic again. > The reason why I was hesitant to decommission Travis was that we were > still facing some issues with the Azure infrastructure that I want to > resolve, so that we have a strong test coverage. > > In the last few weeks, we had the following issues: > - unstable e2e tests (we are running the e2e tests much more frequently, > thus we see more failures (and discover actual bugs!)) > - network issues (mostly around downloading maven artifacts. This is > solved at the cost of slower builds. I'm preparing a fix to have stable & > fast maven downloads) > - the private builds were never really stable (this is work in progress. > the situation is definitely better than the private Travis builds) > - I haven't followed the overall master stability closely before February, > but I have the feeling that April so far was a pretty unstable month on > master. Piotr is regularly reverting commits that somehow broke master. The > problem with unstable master is that is causes a "CI fatigue", were people > assume that failing builds are not worth investigating anymore, leading to > more instability. This is not a problem of the CI infrastructure itself, > but it makes me less confident switching systems :) > > > Unless something unexpected happens, I'm proposing to disable pull request > processing on Travis next week. > > > > On Fri, Apr 17, 2020 at 10:05 AM Gary Yao <g...@apache.org> wrote: > >> I am in favour of decommissioning Travis. >> >> Moreover, I wanted to use this thread to raise another issue with Travis >> that I >> have discovered recently; many of the builds running in my private Travis >> account are timing out in the compilation stage (i.e., compilation takes >> more >> than 50 minutes). This means that I am not able to reliably run a full >> build on >> a CI server without creating a pull request. If other developers also >> experience >> this issue, it would speak for putting more effort into making Azure >> Pipelines >> the project-wide default. >> >> Best, >> Gary >> >> On Thu, Mar 26, 2020 at 12:26 PM Yu Li <car...@gmail.com> wrote: >> >> > Thanks for the clarification Robert. >> > >> > Since the first step plan is to replace the travis PR runs, I checked >> all >> > PR builds from 2020-01-01 (PR#10735-11526) [1], and below is the result: >> > >> > * Travis FAILURE: 298 >> > * Travis SUCCESS: 649 (68.5%) >> > * Azure FAILURE: 420 >> > * Azure SUCCESS: 571 (57.6%) >> > >> > Since the patch for each run is equivalent for Travis and Azure, there >> > seems to be slightly higher failure rate (~10%) when running in Azure. >> > >> > However, with the just-merged fix for uploading logs (FLINK-16480), I >> > believe the success rate of Azure could compete with Travis now >> (uploading >> > files contribute to 20% of the failures according to the report [2]). >> > >> > So I'm +1 to disable travis runs according to the numbers. >> > >> > Best Regards, >> > Yu >> > >> > [1] >> > >> https://github.com/apache/flink/pulls?q=is%3Apr+created%3A%3E%3D2020-01-01 >> > [2] >> > >> > >> https://dev.azure.com/rmetzger/Flink/_pipeline/analytics/stageawareoutcome?definitionId=4 >> > >> > On Thu, 26 Mar 2020 at 03:28, Robert Metzger <rmetz...@apache.org> >> wrote: >> > >> > > Thank you for your responses. >> > > >> > > @Yu Li: In the current master, the log upload always fails, if the e2e >> > job >> > > failed. I just merged a PR that fixes this issue [1]. The problem was >> not >> > > really the network stability, rather a problem with the interaction of >> > the >> > > jobs in the pipeline (the e2e job did not set the right variables for >> the >> > > log upload) >> > > Secondly, you are looking at the report of the "flink-ci.flink" >> pipeline, >> > > where pull requests are build. Naturally, pull request builds fail all >> > the >> > > time, because the PRs are not yet perfect. >> > > >> > > "flink-ci.flink-master" is the right pipeline to look at: >> > > >> > > >> > >> https://dev.azure.com/rmetzger/Flink/_pipeline/analytics/stageawareoutcome?definitionId=8&contextType=build >> > > We have a fairly high number of failures there, because we currently >> have >> > > some issues downloading the maven artifacts [2]. I'm working already >> with >> > > Chesnay on merging a fix for that. >> > > >> > > >> > > [1] >> > > >> > > >> > >> https://github.com/apache/flink/commit/1c86b8b9dd05615a3b2600984db738a9bf388259 >> > > [2]https://issues.apache.org/jira/browse/FLINK-16720 >> > > >> > > >> > > >> > > On Wed, Mar 25, 2020 at 4:48 PM Chesnay Schepler <ches...@apache.org> >> > > wrote: >> > > >> > > > The easiest way to disable travis for pushes is to remove all builds >> > > > from the .travis.yml with a push/pr condition. >> > > > >> > > > On 25/03/2020 15:03, Robert Metzger wrote: >> > > > > Thank you for the feedback so far. >> > > > > >> > > > > Responses to the items Chesnay raised: >> > > > > >> > > > > - by virtue of maintaining the past 2 releases we will have to >> > maintain >> > > > any >> > > > >> Travis infrastructure as long as 1.10 is supported, i.e., until >> 1.12 >> > > > >> >> > > > > Okay. I wasn't sure about the exact policy there. >> > > > > >> > > > > >> > > > >> - the azure setup doesn't appear to be equivalent yet since the >> java >> > > e2e >> > > > >> profile isn't setting the hadoop switch (-Pe2e-hadoop), as a >> result >> > of >> > > > >> which SQLClientKafkaITCase isn't run >> > > > >> >> > > > > I filed a ticket to address this: >> > > > > https://issues.apache.org/jira/browse/FLINK-16778 >> > > > > >> > > > > >> > > > >> - the nightly scripts still seems to be using a maven version >> other >> > > than >> > > > >> 3.2.5; from today on master: >> > > > >> 2020-03-25T05:31:52.7412964Z [INFO] --------< >> > > > >> org.apache.flink:flink-end-to-end-tests-common-kafka >-------- >> > > > >> 2020-03-25T05:31:52.7413854Z [INFO] Building >> > > > >> flink-end-to-end-tests-common-kafka 1.11-SNAPSHOT [39/46] >> > > > >> 2020-03-25T05:31:52.7414689Z [INFO] >> > --------------------------------[ >> > > > jar >> > > > >> ]--------------------------------- >> > > > >> 2020-03-25T05:31:52.7518360Z [INFO] >> > > > >> 2020-03-25T05:31:52.7519770Z [INFO] --- >> > > > maven-checkstyle-plugin:2.17:check >> > > > >> (validate) @ flink-end-to-end-tests-common-kafka --- >> > > > >> >> > > > > I'm planning to address this as part of >> > > > > https://issues.apache.org/jira/browse/FLINK-16411, where I work >> on >> > > > > centralizing all mvn invocations. >> > > > > >> > > > > >> > > > >> - there is no real benefit in retiring the travis support in >> CiBot; >> > > the >> > > > >> important part is whether Travis is run or not for pull requests. >> > > > >> From what I can tell though azure seems to be working fine for >> pull >> > > > >> requests, so +1 to at least disable the travis PR runs. >> > > > > >> > > > > So we disable Travis for https://github.com/flink-ci/flink ? I >> will >> > do >> > > > it >> > > > > once there are no new concerns and above tickets are resolved. >> > > > > >> > > > > What about disabling travis for master pushes? (e.g. removing the >> > > > > .travis.yml file from master)? >> > > > > >> > > > > >> > > > > @Dian: >> > > > > Thanks a lot for your feedback. >> > > > > >> > > > > - The report of Azure is still not viewable[1] (I noticed that >> Hequn >> > > has >> > > > >> also reported this issue in another thread). This is very useful >> > > > >> information. >> > > > > >> > > > > You are referring to the emails send to builds@f.a.o right? >> > > > > I have reported this both as a bug [1] and a feature request [2] >> to >> > > > Azure. >> > > > > But I don't believe they will resolve this issue anytime soon. >> > > > > Azure has an notifications API that we could use to build a >> service >> > > that >> > > > > sends emails to that list, but I feel that this is really a waste >> of >> > > > time. >> > > > > The URL in the link even contains the ID of the build. We would >> just >> > > need >> > > > > to extract this ID and generate the appropriate URL. I will try to >> > > > directly >> > > > > reach the product management of AZP, maybe I can get some >> attention >> > > from >> > > > > there. >> > > > > >> > > > > >> > > > > >> > > > > [1] >> > > > > >> > > > >> > > >> > >> https://developercommunity.visualstudio.com/content/problem/957778/third-parties-are-unable-to-access-notification-li.html?childToView=960403#comment-960403 >> > > > > [2] >> > > > > >> > > > >> > > >> > >> https://developercommunity.visualstudio.com/content/idea/960472/third-parties-are-unable-to-access-notification-li-1.html >> > > > > >> > > > > >> > > > > >> > > > > On Wed, Mar 25, 2020 at 10:34 AM Chesnay Schepler < >> > ches...@apache.org> >> > > > > wrote: >> > > > > >> > > > >> It was left out since it adds significant additional complexity >> and >> > > the >> > > > >> value is dubious at best for PRs that aren't merged shortly after >> > the >> > > > >> build has finished. >> > > > >> >> > > > >> On 25/03/2020 10:28, Dian Fu wrote: >> > > > >>> Thanks for the information. I'm sorry that I'm not aware of this >> > > before >> > > > >> and I have checked the build log of travis and confirmed that >> this >> > is >> > > > true. >> > > > >>> @Chesnay Are there any specific reasons for this and is it >> possible >> > > to >> > > > >> add this back for Azure Pipelines? >> > > > >>> Thanks, >> > > > >>> Dian >> > > > >>> >> > > > >>>> 在 2020年3月25日,下午4:43,Chesnay Schepler <ches...@apache.org> 写道: >> > > > >>>> >> > > > >>>> @Dian we haven't been rebasing PR's against master for months, >> > ever >> > > > >> since we switched to CiBot. >> > > > >>>> On 25/03/2020 09:29, Dian Fu wrote: >> > > > >>>>> Hi Robert, >> > > > >>>>> >> > > > >>>>> Thanks a lot for your great work! >> > > > >>>>> >> > > > >>>>> Overall I'm +1 to switch to Azure as the primary CI tool if >> it's >> > > > >> stable enough as I think there is no need to run both the travis >> and >> > > > Azure >> > > > >> for one single PR. >> > > > >>>>> However, there are still some improvements need to do and it >> > would >> > > be >> > > > >> great if these issues could be addressed before fully switch to >> > Azure: >> > > > >>>>> - The report of Azure is still not viewable[1] (I noticed that >> > > Hequn >> > > > >> has also reported this issue in another thread). This is very >> useful >> > > > >> information. >> > > > >>>>> - For PR test of Azure pipeline, it seems that it will not >> rebase >> > > the >> > > > >> master code before running the tests. >> > > > >>>>> Thanks, >> > > > >>>>> Dian >> > > > >>>>> >> > > > >>>>> [1] >> > > > >> >> > > > >> > > >> > >> https://dev.azure.com/rmetzger/web/build.aspx?pcguid=03e2a4fd-f647-46c5-a324-527d2c2984ce&builduri=vstfs%3a%2f%2f%2fBuild%2fBuild%2f6593&tracking_data=eyJTb3VyY2UiOiJFbWFpbCIsIlR5cGUiOiJOb3RpZmljYXRpb24iLCJTSUQiOiIzMzk0MzciLCJTVHlwZSI6IkdSUCIsIlJlY2lwIjoxLCJfeGNpIjp7Ik5JRCI6NDAyODQ3NzksIk1SZWNpcCI6Im0wPTEgIiwiQWN0IjoiMTNjNDc3YWMtZTBjYS00MjJkLTkxOTItZWI0NzFkZmUzMWY0In0sIkVsZW1lbnQiOiJoZXJvL2N0YSJ9 >> > > > >> < >> > > > >> >> > > > >> > > >> > >> https://dev.azure.com/rmetzger/web/build.aspx?pcguid=03e2a4fd-f647-46c5-a324-527d2c2984ce&builduri=vstfs%3a%2f%2f%2fBuild%2fBuild%2f6593&tracking_data=eyJTb3VyY2UiOiJFbWFpbCIsIlR5cGUiOiJOb3RpZmljYXRpb24iLCJTSUQiOiIzMzk0MzciLCJTVHlwZSI6IkdSUCIsIlJlY2lwIjoxLCJfeGNpIjp7Ik5JRCI6NDAyODQ3NzksIk1SZWNpcCI6Im0wPTEgIiwiQWN0IjoiMTNjNDc3YWMtZTBjYS00MjJkLTkxOTItZWI0NzFkZmUzMWY0In0sIkVsZW1lbnQiOiJoZXJvL2N0YSJ9 >> > > > > >> > > > >> < >> > > > >> >> > > > >> > > >> > >> https://dev.azure.com/rmetzger/web/build.aspx?pcguid=03e2a4fd-f647-46c5-a324-527d2c2984ce&builduri=vstfs:///Build/Build/6593&tracking_data=eyJTb3VyY2UiOiJFbWFpbCIsIlR5cGUiOiJOb3RpZmljYXRpb24iLCJTSUQiOiIzMzk0MzciLCJTVHlwZSI6IkdSUCIsIlJlY2lwIjoxLCJfeGNpIjp7Ik5JRCI6NDAyODQ3NzksIk1SZWNpcCI6Im0wPTEgIiwiQWN0IjoiMTNjNDc3YWMtZTBjYS00MjJkLTkxOTItZWI0NzFkZmUzMWY0In0sIkVsZW1lbnQiOiJoZXJvL2N0YSJ9 >> > > > >> < >> > > > >> >> > > > >> > > >> > >> https://dev.azure.com/rmetzger/web/build.aspx?pcguid=03e2a4fd-f647-46c5-a324-527d2c2984ce&builduri=vstfs:///Build/Build/6593&tracking_data=eyJTb3VyY2UiOiJFbWFpbCIsIlR5cGUiOiJOb3RpZmljYXRpb24iLCJTSUQiOiIzMzk0MzciLCJTVHlwZSI6IkdSUCIsIlJlY2lwIjoxLCJfeGNpIjp7Ik5JRCI6NDAyODQ3NzksIk1SZWNpcCI6Im0wPTEgIiwiQWN0IjoiMTNjNDc3YWMtZTBjYS00MjJkLTkxOTItZWI0NzFkZmUzMWY0In0sIkVsZW1lbnQiOiJoZXJvL2N0YSJ9 >> > > > >>>>>> 在 2020年3月25日,下午3:33,Chesnay Schepler <ches...@apache.org> >> 写道: >> > > > >>>>>> >> > > > >>>>>> Some thoughts: >> > > > >>>>>> - by virtue of maintaining the past 2 releases we will have >> to >> > > > >> maintain any Travis infrastructure as long as 1.10 is supported, >> > i.e., >> > > > >> until 1.12 >> > > > >>>>>> - the azure setup doesn't appear to be equivalent yet since >> the >> > > java >> > > > >> e2e profile isn't setting the hadoop switch (-Pe2e-hadoop), as a >> > > result >> > > > of >> > > > >> which SQLClientKafkaITCase isn't run >> > > > >>>>>> - the nightly scripts still seems to be using a maven version >> > > other >> > > > >> than 3.2.5; from today on master: >> > > > >>>>>> 2020-03-25T05:31:52.7412964Z [INFO] --------< >> > > > >> org.apache.flink:flink-end-to-end-tests-common-kafka >-------- >> > > > >>>>>> 2020-03-25T05:31:52.7413854Z [INFO] Building >> > > > >> flink-end-to-end-tests-common-kafka 1.11-SNAPSHOT [39/46] >> > > > >>>>>> 2020-03-25T05:31:52.7414689Z [INFO] >> > > > --------------------------------[ >> > > > >> jar ]--------------------------------- >> > > > >>>>>> 2020-03-25T05:31:52.7518360Z [INFO] >> > > > >>>>>> 2020-03-25T05:31:52.7519770Z [INFO] --- >> > > > >> maven-checkstyle-plugin:2.17:check (validate) @ >> > > > >> flink-end-to-end-tests-common-kafka --- >> > > > >>>>>> - there is no real benefit in retiring the travis support in >> > > CiBot; >> > > > >> the important part is whether Travis is run or not for pull >> > requests. >> > > > >>>>>> From what I can tell though azure seems to be working fine >> for >> > > > pull >> > > > >> requests, so +1 to at least disable the travis PR runs. >> > > > >>>>>> On 23/03/2020 14:48, Robert Metzger wrote: >> > > > >>>>>>> Hey devs, >> > > > >>>>>>> >> > > > >>>>>>> I would like to discuss whether it makes sense to fully >> switch >> > to >> > > > >> Azure >> > > > >>>>>>> Pipelines and phase out our Travis integration. >> > > > >>>>>>> More information on our Azure integration can be found here: >> > > > >>>>>>> >> > > > >> >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/2020/03/22/Migrating+Flink%27s+CI+Infrastructure+from+Travis+CI+to+Azure+Pipelines >> > > > >>>>>>> Travis will stay for the release-1.10 and older branches, >> as I >> > > have >> > > > >> set up >> > > > >>>>>>> Azure only for the master branch. >> > > > >>>>>>> >> > > > >>>>>>> Proposal: >> > > > >>>>>>> - We keep the flinkbot infrastructure supporting both Travis >> > and >> > > > >> Azure >> > > > >>>>>>> around, while we are still receive pull requests and pushes >> for >> > > the >> > > > >>>>>>> "master" and "release-1.10" branches. >> > > > >>>>>>> - We remove the travis-specific files from "master", so that >> > > builds >> > > > >> are not >> > > > >>>>>>> triggered anymore >> > > > >>>>>>> - once we receive no more builds at Travis (because 1.11 has >> > been >> > > > >>>>>>> released), we remove the remaining travis-related >> > infrastructure >> > > > >>>>>>> >> > > > >>>>>>> What do you think? >> > > > >>>>>>> >> > > > >>>>>>> >> > > > >>>>>>> Best, >> > > > >>>>>>> Robert >> > > > >> >> > > > >> >> > > > >> > > > >> > > >> > >> >