We could limit the (first) trial run to branches.

PRs wouldn't be affected (avoiding a bunch of concerns about maybe blocking PRs and misleading people into thinking that CI is green), we'd have a better handle on how much capacity we are consuming, but contributors would still get the new setup (which for some is better than none).
We'd also side-step any potential security issue for the time being.

On 01/12/2023 05:10, Yangze Guo wrote:
Thanks for the efforts, @Matthias. +1 to start a trial on Github
Actions and migrate the CI if we can prove its computation capacity
and stability.

I share the same concern with Xintong that we do not explicitly claim
the effect of this trial on the contribution procedure. I think you
can elaborate more on this in the migration plan section. Here is my
thought about it:
I prefer to enable the CI workflow based on GitHub Actions for each PR
because it helps us understand its stability and performance under
certain pressures. However, I am not inclined to make "passing the CI
via GitHub Actions" a necessity in the code contribution process, we
can encourage contributors to report unstable cases under a specific
ticket umbrella when they encounter them.

Best,
Yangze Guo

On Thu, Nov 30, 2023 at 12:10 AM Matthias Pohl
<matthias.p...@aiven.io.invalid> wrote:
With regards to Alex' concerns on hardware disparity: I did a bit more
digging on that one. I added my findings in a hardware section to FLIP-396
[1]. It appears that the hardware is more or less the same between the
different hosts. Apache INFRA's runners have more disk space (1TB in
comparison to 14GB), though.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Trial+during+Flink+1.19+Cycle+to+test+migrating+to+GitHub+Actions#FLIP396:TrialduringFlink1.19CycletotestmigratingtoGitHubActions-HardwareSpecifications

On Wed, Nov 29, 2023 at 4:01 PM Matthias Pohl <matthias.p...@aiven.io>
wrote:

Thanks for your feedback Alex. I responded to your comments below:

This is mentioned in the "Limitations of GitHub Actions in the past"
section of the FLIP. Does this also apply to the Apache INFRA setup or
can we expect contributors' runs executed there too?

Workflow runs on Flink forks (independent of PRs that would merge to
Apache Flink's core repo) will be executed with runners provided by GitHub
with their own limitations. Secrets are not set in these runs (similar to
what we have right now with PR runs).

If we allow the PR CI to run on Apache INFRA-hosted ephemeral runners we
might have the same freedom because of their ephemeral nature (the VMs are
discarded leaving).

We only have to start thinking about self-hosted customized runners if we
decide/need to have dedicated VMs for Flink's CI (similar to what we have
right now with Azure CI and Alibaba's VMs). This might happen if the
waiting times for acquiring a runner are too long. In that case, we might
give a certain group of people (e.g. committers) or certain types of events
(for PRs,  nightly builds, PR merges) the ability to use the self-hosted
runners.

As you mentioned in the FLIP, there are some timeout-related test
discrepancies between different setups. Similar discrepancies could
manifest themselves between the Github runners and the Apache INFRA
runners. It would be great if we should have a uniform setup, where if
tests pass in the individual CI, they also pass in the main runner and vice
versa.

I agree. So far, what we've seen is that the timeout instability is coming
from too optimistic timeout configurations in some tests (they eventually
also fail in Azure CI; but the GitHub-provided runners seem to be more
sensitive in this regard). Fixing the tests if such a flakiness is observed
should bring us to a stage where the test behavior is matching between
different runners.

We had a similar issue in the Azure CI setup: Certain tests were more
stable on the Alibaba machines than on Azure VMs. That is why we introduced
a dedicated stage for Azure CI VMs as part of the nightly runs (see
FLINK-18370 [1]). We could do the same for GitHub Actions if necessary.

Currently we have such memory limits-related issues in individual vs main
Azure CI pipelines.

I'm not sure I understand what you mean by memory limit-related issues.
The GitHub-provided runners do not seem to run into memory-related issues.
We have to see whether this also applies to Apache INFRA-provided runners.
My hope is that they have even better hardware than what GitHub offers. But
GitHub-provided runners seem to be a good fallback to rely on (see the
workflows I shared in my previous response to Xintong's message).

[1] https://issues.apache.org/jira/browse/FLINK-18370

On Wed, Nov 29, 2023 at 3:17 PM Matthias Pohl <matthias.p...@aiven.io>
wrote:

Thanks for your comments, Xintong. See my answers below.


I think it would be helpful if we can at the end migrate the CI to an
ASF-managed Github Action, as long as it provides us a similar
computation capacity and stability.

The current test runs in my Flink fork (using the GitHub-provided
runners) suggest that even with using generic GitHub runners we get decent
performance and stability. In this way I'm confident that we wouldn't lose
much.

Here's a comparison of the pipelines once more:
* Nightly workflow: GitHub Actions [1] vs Azure CI [2]
* PR workflow: GitHub Actions [3] vs Azure CI [4]

[1]
https://github.com/XComp/flink/actions/workflows/flink-ci-extended.yml
[2]
https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=1&_a=summary
[3] https://github.com/XComp/flink/actions/workflows/flink-ci-basic.yml
[4] https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2

Regarding the migration plan, I wonder if we should not disable the CIbot
until we fully decide to migrate to Github Actions? In case the nightly
runs don't really work well, it might be debatable whether we should
maintain the CI in two places (i.e. PRs on Github Actions and cron builds
on Azure).

The CIbot handles the PR CI. Disabling it would mean that users would
fully rely on the GitHub Actions workflow right away. I like the fact that
for PRs we actually have both. That makes it more obvious if CI is not on
par.
For the nightly builds, I'm not too worried because they are not exposed
to the contributors that much. That's more a question for the release
managers who are monitoring the nightly runs how they want to handle it.
But even there I see benefits of having both CIs running for some time to
see how much they differ from each other in terms of stability

- What exactly are the changes that would affect contributors during the
trial period? Is it only an additional CI report that you can potentially
just ignore? Or would there be some larger impacts, e.g. you cannot merge a
PR if the Github Action CI is not passed (I don't know, I just made this
up)?

My plan would be to enable the PR CI workflow for PRs as well to have the
comparison. For contributors this would mean that they have an additional
CI point (essentially two CI runs for a PR). If that's not what we want, we
could disable it for PRs and only allow the basic CI run for pushes to
master.

On Wed, Nov 29, 2023 at 2:31 PM Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:

Thanks for driving this Mathhias! +1 for joining the INFRA trial.


Apache Infra did some experimenting on self-hosted runners in
collaboration
with Apache Airflow (see ashb/runner with releases/pr-security-options
branch)
where they only allow certain groups of users (e.g. committers) to run
their
workflows on self-hosted machines. Any other group would have to rely
on
GitHub’s runners.
This is mentioned in the "Limitations of GitHub Actions in the past"
section of
the FLIP. Does this also apply to the Apache INFRA setup or can we expect
contributors' runs executed there too? As you mentioned in the FLIP,
there
are
some timeout-related test discrepancies between different setups. Similar
discrepancies could manifest themselves between the Github runners and
the
Apache INFRA runners. It would be great if we should have a uniform
setup,
where if tests pass in the individual CI, they also pass in the main
runner
and
vice versa.  Currently we have such memory limits-related issues in
individual
vs main Azure CI pipelines.

2. Disable Flink’s CI bot for PRs if step #1 is considered successful
3. Join trial program for ephemeral GHA runners
Due to potential new kinds of instabilities manifesting themselves in the
new setup,
can we keep both CIs running in parallel and keep relying on the existing
one until
we are confident in the tests stability on the new ephemeral GHA infra
(skip 2.)?

Best,
Alex

On Wed, 29 Nov 2023 at 13:42, Xintong Song <tonysong...@gmail.com>
wrote:

Thanks for the efforts, Matthias.


I think it would be helpful if we can at the end migrate the CI to an
ASF-managed Github Action, as long as it provides us a similar
computation
capacity and stability. Given that the proposal is only to start a
trial
and investigate whether the migration is feasible, I don't see much
concern
in this.


I have only one suggestion and one question.

- Regarding the migration plan, I wonder if we should not disable the
CI
bot until we fully decide to migrate to Github Actions? In case the
nightly
runs don't really work well, it might be debatable whether we should
maintain the CI in two places (i.e. PRs on Github Actions and cron
builds
on Azure).

- What exactly are the changes that would affect contributors during
the
trial period? Is it only an additional CI report that you can
potentially
just ignore? Or would there be some larger impacts, e.g. you cannot
merge a
PR if the Github Action CI is not passed (I don't know, I just made
this
up)?


Best,

Xintong



On Wed, Nov 29, 2023 at 8:07 PM Yuxin Tan <tanyuxinw...@gmail.com>
wrote:
Ok, Thanks for the update and the explanations.

Best,
Yuxin


Matthias Pohl <matthias.p...@aiven.io.invalid> 于2023年11月29日周三
15:43写道:
According to the Flip, the new tests will support arm env.
I believe that's good news for arm users. I have a minor
question here. Will it be a blocker before migrating the new
tests? If not,  If not, when can we expect arm environment
support to be implemented? Thanks.

Thanks for your feedback, everyone.

About the ARM support. I want to underline that this FLIP is not
about
migrating to GitHub Actions but to start a trial run in the Apache
Flink
repository. That would allow us to come up with a proper decision
whether
GitHub Actions is what we want. I admit that the title is a bit
"clickbaity". I updated the FLIP's title and its Motivation to make
things
clear.

The FLIP suggests starting a trial period until 1.19 is released
to try
things out. A proper decision on whether we want to migrate would
be
made
at the end of the 1.19 release cycle.

About the ARM support: This related content of the FLIP is entirely
based
on documentation from Apache INFRAs side. INFRA seems to offer
this ARM
support for their ephemeral runners. The ephemeral runners are in
the
testing stage, i.e. these runners are still experimental. Apache
INFRA
asks
Apache projects to join this test.

Whether the ARM support is actually possible to achieve within
Flink is
something we have to figure out as part of the trial run. One
conclusion
of
the trial run could be that we still move ahead with GHA but don't
use
arm
machines due to some blocking issues.

Matthias



On Wed, Nov 29, 2023 at 4:46 AM Yuxin Tan <tanyuxinw...@gmail.com>
wrote:
Hi, Matthias,

Thanks for driving this.
+1 from my side.

According to the Flip, the new tests will support arm env.
I believe that's good news for arm users. I have a minor
question here. Will it be a blocker before migrating the new
tests? If not,  If not, when can we expect arm environment
support to be implemented? Thanks.

Best,
Yuxin


Márton Balassi <balassi.mar...@gmail.com> 于2023年11月29日周三
03:09写道:
Thanks, Matthias. Big +1 from me.

On Tue, Nov 28, 2023 at 5:30 PM Matthias Pohl
<matthias.p...@aiven.io.invalid> wrote:

Thanks for the pointer. I'm planning to join that meeting.

On Tue, Nov 28, 2023 at 4:16 PM Etienne Chauchot <
echauc...@apache.org
wrote:

Hi all,

FYI there is the ASF infra roundtable soon. One of the
subjects
for
this
session is GitHub Actions. It could be worth passing by:

December 6th, 2023 at 1700 UTC on the #Roundtablechannel on
Slack.
For information about theroundtables, and about how to
join,
see:https://infra.apache.org/roundtable.html
<https://infra.apache.org/roundtable.html>

Best

Etienne

Le 24/11/2023 à 14:16, Maximilian Michels a écrit :
Thanks for reviving the efforts here Matthias! +1 for the
transition
to GitHub Actions.

As for ASF Infra Jenkins, it works fine. Jenkins is
extremely
feature-rich. Not sure about the spare capacity though. I
know
that
for Apache Beam, Google donated a bunch of servers to get
additional
build capacity.

-Max


On Thu, Nov 23, 2023 at 10:30 AM Matthias Pohl
<matthias.p...@aiven.io.invalid>  wrote:
Btw. even though we've been focusing on GitHub Actions
with
this
FLIP,
I'm
curious whether somebody has experience with Apache
Infra's
Jenkins
deployment. The discussion I found about Jenkins [1] is
quite
out-dated
(2014). I haven't worked with it myself but could
imagine
that
there
are
some features provided through plugins which are
missing in
GitHub
Actions.
[1]
https://lists.apache.org/thread/vs81xdhn3q777r7x9k7wd4dyl9kvoqn4
On Tue, Nov 21, 2023 at 4:19 PM Matthias Pohl<
matthias.p...@aiven.io>
wrote:

That's a valid point. I updated the FLIP accordingly:

Currently, the secrets (e.g. for S3 access tokens) are
maintained
by
certain PMC members with access to the corresponding
configuration
in
the
Azure CI project. This responsibility will be moved to
Apache
Infra.
They
are in charge of handling secrets in the Apache
organization.
As a
consequence, updating secrets is becoming a bit more
complicated.
This can
be still considered an improvement from a legal
standpoint
because
the
responsibility is transferred from an individual
company
(i.e.
Ververica
who's the maintainer of the Azure CI project) to the
Apache
Foundation.
On Tue, Nov 21, 2023 at 3:37 PM Martijn Visser<
martijnvis...@apache.org>
wrote:

Hi Matthias,

Thanks for the write-up and for the efforts on this. I
really
hope
that we can move away from Azure towards GHA for a
better
integration
as well (directly seeing if a PR can be merged due to
CI
passing
for
example).

The one thing I'm missing in the FLIP is how we would
setup
the
secrets for the nightly runs (for the S3 tests,
potential
tests
with
external services etc). My guess is we need to
provide the
secret
to
ASF Infra and then we would be able to refer to them
in a
pipeline?
Best regards,

Martijn

On Tue, Nov 21, 2023 at 3:05 PM Matthias Pohl
<matthias.p...@aiven.io.invalid>  wrote:
I realized that I mixed up FLIP IDs. FLIP-395 is
already
reserved
[1]. I
switched to FLIP-396 [2] for the sake of
consistency. 8)
[1]
https://lists.apache.org/thread/wjd3nbvg6nt93lb0sd52f0lzls6559tv
[2]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Migration+to+GitHub+Actions
On Tue, Nov 21, 2023 at 2:58 PM Matthias Pohl<
matthias.p...@aiven.io
wrote:

Hi everyone,

The Flink community discussed migrating from Azure
CI to
GitHub
Actions
quite some time ago [1]. The efforts around that
stalled
due
to
limitations
around self-hosted runner support from Apache
Infra’s
side.
There
were some
recent developments on that topic. Apache Infra is
experimenting
with
ephemeral runners now which might enable us to move
ahead
with
GitHub
Actions.

The goal is to join the trial phase for ephemeral
runners
and
experiment
with our CI workflows in terms of stability and
performance.
At
the
end we
can decide whether we want to abandon Azure CI and
move
to
GitHub
Actions
or stick to the former one.

Nico Weidner and Chesnay laid the groundwork on this
topic
in
the
past. I
picked up the work they did and continued
experimenting
with
it
in
my
own
fork XComp/flink [2] the past few weeks. The
workflows
are
in
a
state
where
I think that we start moving the relevant code into
Flink’s
repository.
Example runs for the basic workflow [3] and the
extended
(nightly)
workflow
[4] are provided.

This will bring a few more changes to the Flink
contributors.
That
is
why
I wanted to bring this discussion to the mailing
list
first. I
did a
write
up on (hopefully) all related topics in FLIP-395
[5].
I’m looking forward to your feedback.

Matthias

[1]
https://lists.apache.org/thread/vcyx2nx0mhklqwm827vgykv8pc54gg3k
[2]https://github.com/XComp/flink/actions

[3]
https://github.com/XComp/flink/actions/runs/6926309782
[4]
https://github.com/XComp/flink/actions/runs/6927443941
[5]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-395%3A+Migration+to+GitHub+Actions
--

[image: Aiven]<https://www.aiven.io>

*Matthias Pohl*
Opensource Software Engineer, *Aiven*
matthias.p...@aiven.io  <i...@aiven.io>    |  +49
170
9869525
aiven.io<https://www.aiven.io>    |
<https://www.facebook.com/aivencloud>
<https://www.linkedin.com/company/aiven/>    <
https://twitter.com/aiven_io>
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B


Reply via email to