With regards to Alex' concerns on hardware disparity: I did a bit more digging on that one. I added my findings in a hardware section to FLIP-396 [1]. It appears that the hardware is more or less the same between the different hosts. Apache INFRA's runners have more disk space (1TB in comparison to 14GB), though.
[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Trial+during+Flink+1.19+Cycle+to+test+migrating+to+GitHub+Actions#FLIP396:TrialduringFlink1.19CycletotestmigratingtoGitHubActions-HardwareSpecifications On Wed, Nov 29, 2023 at 4:01 PM Matthias Pohl <matthias.p...@aiven.io> wrote: > Thanks for your feedback Alex. I responded to your comments below: > > This is mentioned in the "Limitations of GitHub Actions in the past" >> section of the FLIP. Does this also apply to the Apache INFRA setup or >> can we expect contributors' runs executed there too? > > > Workflow runs on Flink forks (independent of PRs that would merge to > Apache Flink's core repo) will be executed with runners provided by GitHub > with their own limitations. Secrets are not set in these runs (similar to > what we have right now with PR runs). > > If we allow the PR CI to run on Apache INFRA-hosted ephemeral runners we > might have the same freedom because of their ephemeral nature (the VMs are > discarded leaving). > > We only have to start thinking about self-hosted customized runners if we > decide/need to have dedicated VMs for Flink's CI (similar to what we have > right now with Azure CI and Alibaba's VMs). This might happen if the > waiting times for acquiring a runner are too long. In that case, we might > give a certain group of people (e.g. committers) or certain types of events > (for PRs, nightly builds, PR merges) the ability to use the self-hosted > runners. > > As you mentioned in the FLIP, there are some timeout-related test >> discrepancies between different setups. Similar discrepancies could >> manifest themselves between the Github runners and the Apache INFRA >> runners. It would be great if we should have a uniform setup, where if >> tests pass in the individual CI, they also pass in the main runner and vice >> versa. > > > I agree. So far, what we've seen is that the timeout instability is coming > from too optimistic timeout configurations in some tests (they eventually > also fail in Azure CI; but the GitHub-provided runners seem to be more > sensitive in this regard). Fixing the tests if such a flakiness is observed > should bring us to a stage where the test behavior is matching between > different runners. > > We had a similar issue in the Azure CI setup: Certain tests were more > stable on the Alibaba machines than on Azure VMs. That is why we introduced > a dedicated stage for Azure CI VMs as part of the nightly runs (see > FLINK-18370 [1]). We could do the same for GitHub Actions if necessary. > > Currently we have such memory limits-related issues in individual vs main >> Azure CI pipelines. > > > I'm not sure I understand what you mean by memory limit-related issues. > The GitHub-provided runners do not seem to run into memory-related issues. > We have to see whether this also applies to Apache INFRA-provided runners. > My hope is that they have even better hardware than what GitHub offers. But > GitHub-provided runners seem to be a good fallback to rely on (see the > workflows I shared in my previous response to Xintong's message). > > [1] https://issues.apache.org/jira/browse/FLINK-18370 > > On Wed, Nov 29, 2023 at 3:17 PM Matthias Pohl <matthias.p...@aiven.io> > wrote: > >> Thanks for your comments, Xintong. See my answers below. >> >> >>> I think it would be helpful if we can at the end migrate the CI to an >>> ASF-managed Github Action, as long as it provides us a similar >>> computation capacity and stability. >> >> >> The current test runs in my Flink fork (using the GitHub-provided >> runners) suggest that even with using generic GitHub runners we get decent >> performance and stability. In this way I'm confident that we wouldn't lose >> much. >> >> Here's a comparison of the pipelines once more: >> * Nightly workflow: GitHub Actions [1] vs Azure CI [2] >> * PR workflow: GitHub Actions [3] vs Azure CI [4] >> >> [1] >> https://github.com/XComp/flink/actions/workflows/flink-ci-extended.yml >> [2] >> https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=1&_a=summary >> [3] https://github.com/XComp/flink/actions/workflows/flink-ci-basic.yml >> [4] https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2 >> >> Regarding the migration plan, I wonder if we should not disable the CIbot >>> until we fully decide to migrate to Github Actions? In case the nightly >>> runs don't really work well, it might be debatable whether we should >>> maintain the CI in two places (i.e. PRs on Github Actions and cron builds >>> on Azure). >> >> >> The CIbot handles the PR CI. Disabling it would mean that users would >> fully rely on the GitHub Actions workflow right away. I like the fact that >> for PRs we actually have both. That makes it more obvious if CI is not on >> par. >> For the nightly builds, I'm not too worried because they are not exposed >> to the contributors that much. That's more a question for the release >> managers who are monitoring the nightly runs how they want to handle it. >> But even there I see benefits of having both CIs running for some time to >> see how much they differ from each other in terms of stability >> >> - What exactly are the changes that would affect contributors during the >>> trial period? Is it only an additional CI report that you can potentially >>> just ignore? Or would there be some larger impacts, e.g. you cannot merge a >>> PR if the Github Action CI is not passed (I don't know, I just made this >>> up)? >> >> >> My plan would be to enable the PR CI workflow for PRs as well to have the >> comparison. For contributors this would mean that they have an additional >> CI point (essentially two CI runs for a PR). If that's not what we want, we >> could disable it for PRs and only allow the basic CI run for pushes to >> master. >> >> On Wed, Nov 29, 2023 at 2:31 PM Alexander Fedulov < >> alexander.fedu...@gmail.com> wrote: >> >>> Thanks for driving this Mathhias! +1 for joining the INFRA trial. >>> >>> >>> > Apache Infra did some experimenting on self-hosted runners in >>> collaboration >>> > with Apache Airflow (see ashb/runner with releases/pr-security-options >>> branch) >>> > where they only allow certain groups of users (e.g. committers) to run >>> their >>> > workflows on self-hosted machines. Any other group would have to rely >>> on >>> > GitHub’s runners. >>> >>> This is mentioned in the "Limitations of GitHub Actions in the past" >>> section of >>> the FLIP. Does this also apply to the Apache INFRA setup or can we expect >>> contributors' runs executed there too? As you mentioned in the FLIP, >>> there >>> are >>> some timeout-related test discrepancies between different setups. Similar >>> discrepancies could manifest themselves between the Github runners and >>> the >>> Apache INFRA runners. It would be great if we should have a uniform >>> setup, >>> where if tests pass in the individual CI, they also pass in the main >>> runner >>> and >>> vice versa. Currently we have such memory limits-related issues in >>> individual >>> vs main Azure CI pipelines. >>> >>> >2. Disable Flink’s CI bot for PRs if step #1 is considered successful >>> >3. Join trial program for ephemeral GHA runners >>> >>> Due to potential new kinds of instabilities manifesting themselves in the >>> new setup, >>> can we keep both CIs running in parallel and keep relying on the existing >>> one until >>> we are confident in the tests stability on the new ephemeral GHA infra >>> (skip 2.)? >>> >>> Best, >>> Alex >>> >>> On Wed, 29 Nov 2023 at 13:42, Xintong Song <tonysong...@gmail.com> >>> wrote: >>> >>> > Thanks for the efforts, Matthias. >>> > >>> > >>> > I think it would be helpful if we can at the end migrate the CI to an >>> > ASF-managed Github Action, as long as it provides us a similar >>> computation >>> > capacity and stability. Given that the proposal is only to start a >>> trial >>> > and investigate whether the migration is feasible, I don't see much >>> concern >>> > in this. >>> > >>> > >>> > I have only one suggestion and one question. >>> > >>> > - Regarding the migration plan, I wonder if we should not disable the >>> CI >>> > bot until we fully decide to migrate to Github Actions? In case the >>> nightly >>> > runs don't really work well, it might be debatable whether we should >>> > maintain the CI in two places (i.e. PRs on Github Actions and cron >>> builds >>> > on Azure). >>> > >>> > - What exactly are the changes that would affect contributors during >>> the >>> > trial period? Is it only an additional CI report that you can >>> potentially >>> > just ignore? Or would there be some larger impacts, e.g. you cannot >>> merge a >>> > PR if the Github Action CI is not passed (I don't know, I just made >>> this >>> > up)? >>> > >>> > >>> > Best, >>> > >>> > Xintong >>> > >>> > >>> > >>> > On Wed, Nov 29, 2023 at 8:07 PM Yuxin Tan <tanyuxinw...@gmail.com> >>> wrote: >>> > >>> > > Ok, Thanks for the update and the explanations. >>> > > >>> > > Best, >>> > > Yuxin >>> > > >>> > > >>> > > Matthias Pohl <matthias.p...@aiven.io.invalid> 于2023年11月29日周三 >>> 15:43写道: >>> > > >>> > > > > >>> > > > > According to the Flip, the new tests will support arm env. >>> > > > > I believe that's good news for arm users. I have a minor >>> > > > > question here. Will it be a blocker before migrating the new >>> > > > > tests? If not, If not, when can we expect arm environment >>> > > > > support to be implemented? Thanks. >>> > > > >>> > > > >>> > > > Thanks for your feedback, everyone. >>> > > > >>> > > > About the ARM support. I want to underline that this FLIP is not >>> about >>> > > > migrating to GitHub Actions but to start a trial run in the Apache >>> > Flink >>> > > > repository. That would allow us to come up with a proper decision >>> > whether >>> > > > GitHub Actions is what we want. I admit that the title is a bit >>> > > > "clickbaity". I updated the FLIP's title and its Motivation to make >>> > > things >>> > > > clear. >>> > > > >>> > > > The FLIP suggests starting a trial period until 1.19 is released >>> to try >>> > > > things out. A proper decision on whether we want to migrate would >>> be >>> > made >>> > > > at the end of the 1.19 release cycle. >>> > > > >>> > > > About the ARM support: This related content of the FLIP is entirely >>> > based >>> > > > on documentation from Apache INFRAs side. INFRA seems to offer >>> this ARM >>> > > > support for their ephemeral runners. The ephemeral runners are in >>> the >>> > > > testing stage, i.e. these runners are still experimental. Apache >>> INFRA >>> > > asks >>> > > > Apache projects to join this test. >>> > > > >>> > > > Whether the ARM support is actually possible to achieve within >>> Flink is >>> > > > something we have to figure out as part of the trial run. One >>> > conclusion >>> > > of >>> > > > the trial run could be that we still move ahead with GHA but don't >>> use >>> > > arm >>> > > > machines due to some blocking issues. >>> > > > >>> > > > Matthias >>> > > > >>> > > > >>> > > > >>> > > > On Wed, Nov 29, 2023 at 4:46 AM Yuxin Tan <tanyuxinw...@gmail.com> >>> > > wrote: >>> > > > >>> > > > > Hi, Matthias, >>> > > > > >>> > > > > Thanks for driving this. >>> > > > > +1 from my side. >>> > > > > >>> > > > > According to the Flip, the new tests will support arm env. >>> > > > > I believe that's good news for arm users. I have a minor >>> > > > > question here. Will it be a blocker before migrating the new >>> > > > > tests? If not, If not, when can we expect arm environment >>> > > > > support to be implemented? Thanks. >>> > > > > >>> > > > > Best, >>> > > > > Yuxin >>> > > > > >>> > > > > >>> > > > > Márton Balassi <balassi.mar...@gmail.com> 于2023年11月29日周三 >>> 03:09写道: >>> > > > > >>> > > > > > Thanks, Matthias. Big +1 from me. >>> > > > > > >>> > > > > > On Tue, Nov 28, 2023 at 5:30 PM Matthias Pohl >>> > > > > > <matthias.p...@aiven.io.invalid> wrote: >>> > > > > > >>> > > > > > > Thanks for the pointer. I'm planning to join that meeting. >>> > > > > > > >>> > > > > > > On Tue, Nov 28, 2023 at 4:16 PM Etienne Chauchot < >>> > > > echauc...@apache.org >>> > > > > > >>> > > > > > > wrote: >>> > > > > > > >>> > > > > > > > Hi all, >>> > > > > > > > >>> > > > > > > > FYI there is the ASF infra roundtable soon. One of the >>> subjects >>> > > for >>> > > > > > this >>> > > > > > > > session is GitHub Actions. It could be worth passing by: >>> > > > > > > > >>> > > > > > > > December 6th, 2023 at 1700 UTC on the #Roundtablechannel on >>> > > Slack. >>> > > > > > > > >>> > > > > > > > For information about theroundtables, and about how to >>> join, >>> > > > > > > > see:https://infra.apache.org/roundtable.html >>> > > > > > > > <https://infra.apache.org/roundtable.html> >>> > > > > > > > >>> > > > > > > > Best >>> > > > > > > > >>> > > > > > > > Etienne >>> > > > > > > > >>> > > > > > > > Le 24/11/2023 à 14:16, Maximilian Michels a écrit : >>> > > > > > > > > Thanks for reviving the efforts here Matthias! +1 for the >>> > > > > transition >>> > > > > > > > > to GitHub Actions. >>> > > > > > > > > >>> > > > > > > > > As for ASF Infra Jenkins, it works fine. Jenkins is >>> extremely >>> > > > > > > > > feature-rich. Not sure about the spare capacity though. I >>> > know >>> > > > that >>> > > > > > > > > for Apache Beam, Google donated a bunch of servers to get >>> > > > > additional >>> > > > > > > > > build capacity. >>> > > > > > > > > >>> > > > > > > > > -Max >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > On Thu, Nov 23, 2023 at 10:30 AM Matthias Pohl >>> > > > > > > > > <matthias.p...@aiven.io.invalid> wrote: >>> > > > > > > > >> Btw. even though we've been focusing on GitHub Actions >>> with >>> > > this >>> > > > > > FLIP, >>> > > > > > > > I'm >>> > > > > > > > >> curious whether somebody has experience with Apache >>> Infra's >>> > > > > Jenkins >>> > > > > > > > >> deployment. The discussion I found about Jenkins [1] is >>> > quite >>> > > > > > > out-dated >>> > > > > > > > >> (2014). I haven't worked with it myself but could >>> imagine >>> > that >>> > > > > there >>> > > > > > > are >>> > > > > > > > >> some features provided through plugins which are >>> missing in >>> > > > GitHub >>> > > > > > > > Actions. >>> > > > > > > > >> >>> > > > > > > > >> [1] >>> > > > > https://lists.apache.org/thread/vs81xdhn3q777r7x9k7wd4dyl9kvoqn4 >>> > > > > > > > >> >>> > > > > > > > >> On Tue, Nov 21, 2023 at 4:19 PM Matthias Pohl< >>> > > > > > matthias.p...@aiven.io> >>> > > > > > > > >> wrote: >>> > > > > > > > >> >>> > > > > > > > >>> That's a valid point. I updated the FLIP accordingly: >>> > > > > > > > >>> >>> > > > > > > > >>>> Currently, the secrets (e.g. for S3 access tokens) are >>> > > > > maintained >>> > > > > > by >>> > > > > > > > >>>> certain PMC members with access to the corresponding >>> > > > > configuration >>> > > > > > > in >>> > > > > > > > the >>> > > > > > > > >>>> Azure CI project. This responsibility will be moved to >>> > > Apache >>> > > > > > Infra. >>> > > > > > > > They >>> > > > > > > > >>>> are in charge of handling secrets in the Apache >>> > > organization. >>> > > > > As a >>> > > > > > > > >>>> consequence, updating secrets is becoming a bit more >>> > > > > complicated. >>> > > > > > > > This can >>> > > > > > > > >>>> be still considered an improvement from a legal >>> standpoint >>> > > > > because >>> > > > > > > the >>> > > > > > > > >>>> responsibility is transferred from an individual >>> company >>> > > (i.e. >>> > > > > > > > Ververica >>> > > > > > > > >>>> who's the maintainer of the Azure CI project) to the >>> > Apache >>> > > > > > > > Foundation. >>> > > > > > > > >>> >>> > > > > > > > >>> On Tue, Nov 21, 2023 at 3:37 PM Martijn Visser< >>> > > > > > > > martijnvis...@apache.org> >>> > > > > > > > >>> wrote: >>> > > > > > > > >>> >>> > > > > > > > >>>> Hi Matthias, >>> > > > > > > > >>>> >>> > > > > > > > >>>> Thanks for the write-up and for the efforts on this. I >>> > > really >>> > > > > hope >>> > > > > > > > >>>> that we can move away from Azure towards GHA for a >>> better >>> > > > > > > integration >>> > > > > > > > >>>> as well (directly seeing if a PR can be merged due to >>> CI >>> > > > passing >>> > > > > > for >>> > > > > > > > >>>> example). >>> > > > > > > > >>>> >>> > > > > > > > >>>> The one thing I'm missing in the FLIP is how we would >>> > setup >>> > > > the >>> > > > > > > > >>>> secrets for the nightly runs (for the S3 tests, >>> potential >>> > > > tests >>> > > > > > with >>> > > > > > > > >>>> external services etc). My guess is we need to >>> provide the >>> > > > > secret >>> > > > > > to >>> > > > > > > > >>>> ASF Infra and then we would be able to refer to them >>> in a >>> > > > > > pipeline? >>> > > > > > > > >>>> >>> > > > > > > > >>>> Best regards, >>> > > > > > > > >>>> >>> > > > > > > > >>>> Martijn >>> > > > > > > > >>>> >>> > > > > > > > >>>> On Tue, Nov 21, 2023 at 3:05 PM Matthias Pohl >>> > > > > > > > >>>> <matthias.p...@aiven.io.invalid> wrote: >>> > > > > > > > >>>>> I realized that I mixed up FLIP IDs. FLIP-395 is >>> already >>> > > > > reserved >>> > > > > > > > [1]. I >>> > > > > > > > >>>>> switched to FLIP-396 [2] for the sake of >>> consistency. 8) >>> > > > > > > > >>>>> >>> > > > > > > > >>>>> [1] >>> > > > > > > >>> https://lists.apache.org/thread/wjd3nbvg6nt93lb0sd52f0lzls6559tv >>> > > > > > > > >>>>> [2] >>> > > > > > > > >>>>> >>> > > > > > > > >>>> >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Migration+to+GitHub+Actions >>> > > > > > > > >>>>> On Tue, Nov 21, 2023 at 2:58 PM Matthias Pohl< >>> > > > > > > matthias.p...@aiven.io >>> > > > > > > > > >>> > > > > > > > >>>>> wrote: >>> > > > > > > > >>>>> >>> > > > > > > > >>>>>> Hi everyone, >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> The Flink community discussed migrating from Azure >>> CI to >>> > > > > GitHub >>> > > > > > > > >>>> Actions >>> > > > > > > > >>>>>> quite some time ago [1]. The efforts around that >>> stalled >>> > > due >>> > > > > to >>> > > > > > > > >>>> limitations >>> > > > > > > > >>>>>> around self-hosted runner support from Apache >>> Infra’s >>> > > side. >>> > > > > > There >>> > > > > > > > >>>> were some >>> > > > > > > > >>>>>> recent developments on that topic. Apache Infra is >>> > > > > experimenting >>> > > > > > > > with >>> > > > > > > > >>>>>> ephemeral runners now which might enable us to move >>> > ahead >>> > > > with >>> > > > > > > > GitHub >>> > > > > > > > >>>>>> Actions. >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> The goal is to join the trial phase for ephemeral >>> > runners >>> > > > and >>> > > > > > > > >>>> experiment >>> > > > > > > > >>>>>> with our CI workflows in terms of stability and >>> > > performance. >>> > > > > At >>> > > > > > > the >>> > > > > > > > >>>> end we >>> > > > > > > > >>>>>> can decide whether we want to abandon Azure CI and >>> move >>> > to >>> > > > > > GitHub >>> > > > > > > > >>>> Actions >>> > > > > > > > >>>>>> or stick to the former one. >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> Nico Weidner and Chesnay laid the groundwork on this >>> > topic >>> > > > in >>> > > > > > the >>> > > > > > > > >>>> past. I >>> > > > > > > > >>>>>> picked up the work they did and continued >>> experimenting >>> > > with >>> > > > > it >>> > > > > > in >>> > > > > > > > my >>> > > > > > > > >>>> own >>> > > > > > > > >>>>>> fork XComp/flink [2] the past few weeks. The >>> workflows >>> > are >>> > > > in >>> > > > > a >>> > > > > > > > state >>> > > > > > > > >>>> where >>> > > > > > > > >>>>>> I think that we start moving the relevant code into >>> > > Flink’s >>> > > > > > > > >>>> repository. >>> > > > > > > > >>>>>> Example runs for the basic workflow [3] and the >>> extended >>> > > > > > (nightly) >>> > > > > > > > >>>> workflow >>> > > > > > > > >>>>>> [4] are provided. >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> This will bring a few more changes to the Flink >>> > > > contributors. >>> > > > > > That >>> > > > > > > > is >>> > > > > > > > >>>> why >>> > > > > > > > >>>>>> I wanted to bring this discussion to the mailing >>> list >>> > > > first. I >>> > > > > > > did a >>> > > > > > > > >>>> write >>> > > > > > > > >>>>>> up on (hopefully) all related topics in FLIP-395 >>> [5]. >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> I’m looking forward to your feedback. >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> Matthias >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> [1] >>> > > > > > > >>> https://lists.apache.org/thread/vcyx2nx0mhklqwm827vgykv8pc54gg3k >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> [2]https://github.com/XComp/flink/actions >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> [3] >>> > https://github.com/XComp/flink/actions/runs/6926309782 >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> [4] >>> > https://github.com/XComp/flink/actions/runs/6927443941 >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> [5] >>> > > > > > > > >>>>>> >>> > > > > > > > >>>> >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-395%3A+Migration+to+GitHub+Actions >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> -- >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> [image: Aiven]<https://www.aiven.io> >>> > > > > > > > >>>>>> >>> > > > > > > > >>>>>> *Matthias Pohl* >>> > > > > > > > >>>>>> Opensource Software Engineer, *Aiven* >>> > > > > > > > >>>>>> matthias.p...@aiven.io <i...@aiven.io> | +49 >>> 170 >>> > > > 9869525 >>> > > > > > > > >>>>>> aiven.io<https://www.aiven.io> | >>> > > > > > > > >>>>>> <https://www.facebook.com/aivencloud> >>> > > > > > > > >>>>>> <https://www.linkedin.com/company/aiven/> < >>> > > > > > > > >>>> https://twitter.com/aiven_io> >>> > > > > > > > >>>>>> *Aiven Deutschland GmbH* >>> > > > > > > > >>>>>> Alexanderufer 3-7, 10117 Berlin >>> > > > > > > > >>>>>> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen >>> > > > > > > > >>>>>> Amtsgericht Charlottenburg, HRB 209739 B >>> > > > > > > > >>>>>> >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >>