Hi all,

as several people know by now, we are planning to move from Azure CI to
Github Actions. This is motivated by (not an exhaustive list):
- Not needing to mirror the repo anymore for CI
- Improving the contributor experience, especially for new contributors
- GHA development being more active than Azure CI development

In case someone wants to check out the current version of the planned GHA
workflow, you can find it here:
https://github.com/ververica/flink/blob/master/.github/workflows/hadoop-2.8.3-scala-2.12-workflow.yml
Past runs can be seen here: https://github.com/ververica/flink/actions (lots
of red, but this is almost always not due to the workflow)

I want to put a draft for the migration roadmap up for discussion. It's
divided into several phases:

*Phase 1: *GHA activated on master (but not required)
- A single CI machine is converted to run GHA runners (instead of Azure
runners) and runs the workflow on pushes to master
- Azure CI remains unchanged and is still the source of truth
- We can compare runtimes and behavior/failures
- Timeframe: 2 weeks

*Phase 2: *Additional features
- Any additional functionality that we want to add to GHA is added (e.g.
not running the workflow if workflow files were modified)
- Functionality from FlinkCIBot that we want to keep is ported over
(syncing with the mirror repo can be dropped, but there are some automated
checks that we want to keep)
- We can monitor whether performance is impacted by any change
- Timeframe: 2 weeks

*Phase 3: *Cron jobs and (some) PR triggers run on GHA
- GHA cron builds activated (for master and release branches)
    - Note: Includes some backports to all affected branches, else the
workflows won’t run:
https://stackoverflow.com/questions/61989951/github-action-workflow-not-running/61992817#61992817
- GHA builds run for PRs of select committers (the idea is to try out
builds for all the intended trigger conditions)
- Timeframe: 1 week

*Up to this point, the existing CI pipeline is mostly unaffected - we only
took away one CI machine.*

*Phase 4: *Full switch to GHA
- Set up GHA runners on all machines
- GHA builds are activated for all PRs
- Either Azure or GHA build is required
- GHA runners are activated, Azure runners are deactivated (but not yet
removed) apart from 1 machine (for stragglers)
- Azure cron jobs are disabled, but kept around in case we need to revert
- Timeframe: 1-2 weeks

*Phase 5: *Removal of Azure CI leftovers
- Only after we are satisfied that GHA is stable (at least 1 month after
the switch, can be longer)
- Green GHA build is required from now on
- Stale PRs that don't have a GHA run will have to trigger a new one (but
they would most likely have to rebase anyway...)
- (old) FlinkCIBot is disabled
- Azure yamls are deleted
- Azure runners are removed from machines


Timing-wise, the full switch to GHA should happen during a quiet time, far
away from a release. The remaining phases shouldn't have much impact, but
right before a release is not a good moment, of course.
Please give us your thoughts and point out anything we missed or that
doesn't seem to make sense!

Best,
Nico

Reply via email to