Thanks Lari for the detailed explanation. This is kind of an emergency situation and I believe your plan is the way to go now.
I already prepared a pull for moving the flaky suite out of the Pulsar CI workflow: https://github.com/nicoloboschi/pulsar/pull/8 I can take care of the execution of the plan. > 1. Cancel all existing builds in_progress or queued I have a script locally that uses GHA to check and cancel the pending runs. We can extend it to all the queued builds (will share it soon). > 2. Edit .asf.yaml and drop the "required checks" requirement for merging PRs. > 3. Wait for build to run for .asf.yaml change, merge it After the pull is out, we'll need to cancel all other workflows that contributors may inadvertently have triggered. > 4. Disable all workflows > 5. Process specific PRs manually to improve the situation. > - Make GHA workflow improvements such as https://github.com/apache/pulsar/pull/17491 and https://github.com/apache/pulsar/pull/17490 > - Quarantine all very flaky tests so that everyone doesn't waste time with those. It should be possible to merge a PR even when a quarantined test fails. in this step we will merge this https://github.com/nicoloboschi/pulsar/pull/8 I want to add to the list this improvement to reduce runners usage in case of doc or cpp changes. https://github.com/nicoloboschi/pulsar/pull/7 > 6. Rebase PRs (or close and re-open) that would be processed next so that changes are picked up It's better to leave this task to the author of the pull in order to not create too much load at the same time > 7. Enable workflows > 8. Start processing PRs with checks to see if things are handled in a better way. > 9. When things are stable, enable required checks again in .asf.yaml, in the meantime be careful about merging PRs > 10. Fix quarantined flaky tests Nicolò Boschi Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <lhot...@apache.org> ha scritto: > If my assumption of the GitHub usage metrics bug in the GitHub Actions > build job queue fairness algorithm is correct, what would help is running > the flaky unit test group outside of Pulsar CI workflow. In that case, the > impact of the usage metrics would be limited. > > The example of > https://github.com/apache/pulsar/actions/runs/3003787409/usage shows this > flaw as explained in the previous email. The total reported execution time > in that report is 1d 1h 40m 21s of usage and the actual usage is about 1/3 > of this. > > When we move the most commonly failing job out of Pulsar CI workflow, the > impact of the possible usage metrics bug would be much less. I hope GitHub > support responds to my issue and queries about this bug. It might take up > to 7 days to get a reply and for technical questions more time. In the > meantime we need a solution for getting over this CI slowness issue. > > -Lari > > > > On 2022/09/08 06:34:42 Lari Hotari wrote: > > My current assumption of the CI slowness problem is that the usage > metrics for Apache Pulsar builds on GitHub side is done incorrectly and > that is resulting in apache/pulsar builds getting throttled. This > assumption might be wrong, but it's the best guess at the moment. > > > > The facts that support this assumption is that when re-running failed > jobs in a workflow, the execution times for previously successful jobs get > counted as if they have all run: > > Here's an example: > https://github.com/apache/pulsar/actions/runs/3003787409/usage > > The reported total usage is about 3x than the actual usage. > > > > The assumption that I have is that the "fairness algorithm" that GitHub > uses to provide all Apache projects about the same amount of GitHub Actions > resources would take this flawed usage as the basis of it's decisions and > it decides to throttle apache/pulsar builds. > > > > The reason why we are getting hit by this now is that there is a high > number of flaky test failures that cause almost every build to fail and we > have been re-running a lot of builds. > > > > The other fact to support the theory of flawed usage metrics used in the > fairness algorithm is that other Apache projects aren't reporting issues > about GitHub Actions slowness. This is mentioned in Jarek Potiuk's comments > on INFRA-23633 [1]: > > > Unlike the case 2 years ago, the problem is not affecting all > projects. In Apache Airflow we do > not see any particular slow-down with > Public Runners at this moment (just checked - > > > > everything is "as usual").. So I'd say it is something specific to > Pulsar not to "ASF" as a whole. > > > > There are also other comments from Jarek about the GitHub "fairness > algorithm" (comment [2], other comment [3]) > > > But I believe the current problem is different - it might be (looking > at your jobs) simply a bug > > > in GA that you hit or indeed your demands are simply too high. > > > > I have opened tickets (2 tickets: 2 days ago and yesterday) to > support.github.com and there hasn't been any response to the ticket. It > might take up to 7 days to get a response. We cannot rely on GitHub Support > resolving this issue. > > > > I propose that we go ahead with the previously suggested action plan > > > One possible way forward: > > > 1. Cancel all existing builds in_progress or queued > > > 2. Edit .asf.yaml and drop the "required checks" requirement for > merging PRs. > > > 3. Wait for build to run for .asf.yaml change, merge it > > > 4. Disable all workflows > > > 5. Process specific PRs manually to improve the situation. > > > - Make GHA workflow improvements such as > https://github.com/apache/pulsar/pull/17491 and > https://github.com/apache/pulsar/pull/17490 > > > - Quarantine all very flaky tests so that everyone doesn't waste > time with those. It should be possible to merge a PR even when a > quarantined test fails. > > > 6. Rebase PRs (or close and re-open) that would be processed next so > that changes are picked up > > > 7. Enable workflows > > > 8. Start processing PRs with checks to see if things are handled in a > better way. > > > 9. When things are stable, enable required checks again in .asf.yaml, > in the meantime be careful about merging PRs > > > 10. Fix quarantined flaky tests > > > > To clarify, steps 1-6 would be done optimally in 1 day and we would stop > processing ordinary PRs during this time. We would only handle PRs that fix > the CI situation during this exceptional period. > > > > -Lari > > > > Links to Jarek's comments: > > [1] > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749 > > [2] > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893 > > [3] > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893 > > > > On 2022/09/07 17:01:43 Lari Hotari wrote: > > > One possible way forward: > > > 1. Cancel all existing builds in_progress or queued > > > 2. Edit .asf.yaml and drop the "required checks" requirement for > merging PRs. > > > 3. Wait for build to run for .asf.yaml change, merge it > > > 4. Disable all workflows > > > 5. Process specific PRs manually to improve the situation. > > > - Make GHA workflow improvements such as > https://github.com/apache/pulsar/pull/17491 and > https://github.com/apache/pulsar/pull/17490 > > > - Quarantine all very flaky tests so that everyone doesn't waste > time with those. It should be possible to merge a PR even when a > quarantined test fails. > > > 6. Rebase PRs (or close and re-open) that would be processed next so > that changes are picked up > > > 7. Enable workflows > > > 8. Start processing PRs with checks to see if things are handled in a > better way. > > > 9. When things are stable, enable required checks again in .asf.yaml, > in the meantime be careful about merging PRs > > > 10. Fix quarantined flaky tests > > > > > > -Lari > > > > > > On 2022/09/07 16:47:09 Lari Hotari wrote: > > > > The problem with CI is becoming worse. The build queue is 235 jobs > now and the queue time is over 7 hours. > > > > > > > > We will need to start shedding load in the build queue and get some > fixes in. > > > > https://issues.apache.org/jira/browse/INFRA-23633 continues to > contain details about some activities. I have created 2 GitHub Support > tickets, but usually it takes up to a week to get a response. > > > > > > > > I have some assumptions about the issue, but they are just > assumptions. > > > > One oddity is that when re-running failed jobs is used in a large > workflow, the execution times for previously successful jobs get counted as > if they have run. > > > > Here's an example: > https://github.com/apache/pulsar/actions/runs/3003787409/usage > > > > The reported usage is about 3x than the actual usage. > > > > The assumption that I have is that the "fairness algorithm" that > GitHub uses to provide all Apache projects about the same amount of GitHub > Actions resources would take this flawed usage as the basis of it's > decisions. > > > > The reason why we are getting hit by this now is that there is a > high number of flaky test failures that cause almost every build to fail > and we are re-running a lot of builds. > > > > > > > > Another problem there is that the GitHub Actions search doesn't > always show all workflow runs that are running. This has happened before > when the GitHub Actions workflow search index was corrupted. GitHub Support > resolved that by rebuilding the search index with some manual admin > operation behind the scenes. > > > > > > > > I'm proposing that we start shedding load from CI by cancelling > build jobs and selecting which jobs to process so that we get the CI issue > resolved. We might also have to disable required checks so that we have > some way to get changes merged while CI doesn't work properly. > > > > > > > > I'm expecting lazy consensus on fixing CI unless someone proposes a > better plan. Let's keep everyone informed in this mailing list thread. > > > > > > > > -Lari > > > > > > > > > > > > On 2022/09/06 14:41:07 Dave Fisher wrote: > > > > > We are going to need to take actions to fix our problems. See > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749 > > > > > > > > > > Jarek has done a large amount of GitHub Action work with Apache > Airflow and his suggestions might be helpful. One of his suggestions was > Apache Yetus. I think he means using the Maven plugins - > https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/ > > > > > > > > > > > > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lhot...@apache.org> > wrote: > > > > > > > > > > > > The Apache Infra ticket is > https://issues.apache.org/jira/browse/INFRA-23633 . > > > > > > > > > > > > -Lari > > > > > > > > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote: > > > > > >> I asked for an update on the Apache org GitHub Actions usage > stats from Gavin McDonald on the-asf slack in this thread: > https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 > . > > > > > >> > > > > > >> I hope we get this issue resolved since it delays PR processing > a lot. > > > > > >> > > > > > >> -Lari > > > > > >> > > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote: > > > > > >>> Pulsar CI continues to be congested, and the build queue [1] > is very long at the moment. There are 147 build jobs in the queue and 16 > jobs in progress at the moment. > > > > > >>> > > > > > >>> I would strongly advice everyone to use "personal CI" to > mitigate the issue of the long delay of CI feedback. You can simply open a > PR to your own personal fork of apache/pulsar to run the builds in your > "personal CI". There's more details in the previous emails in this thread. > > > > > >>> > > > > > >>> -Lari > > > > > >>> > > > > > >>> [1] - build queue: > https://github.com/apache/pulsar/actions?query=is%3Aqueued > > > > > >>> > > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote: > > > > > >>>> Pulsar CI continues to be congested, and the build queue is > long. > > > > > >>>> > > > > > >>>> I would strongly advice everyone to use "personal CI" to > mitigate the issue of the long delay of CI feedback. You can simply open a > PR to your own personal fork of apache/pulsar to run the builds in your > "personal CI". There's more details in the previous email in this thread. > > > > > >>>> > > > > > >>>> Some updates: > > > > > >>>> > > > > > >>>> There has been a discussion with Gavin McDonald from ASF > infra on the-asf slack about getting usage reports from GitHub to support > the investigation. Slack thread is the same one mentioned in the previous > email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . > Gavin already requested the usage report in GitHub UI, but it produced > invalid results. > > > > > >>>> > > > > > >>>> I made a change to mitigate a source of additional GitHub > Actions overhead. > > > > > >>>> In the past, each cherry-picked commit to a maintenance > branch of Pulsar has triggered a lot of workflow runs. > > > > > >>>> > > > > > >>>> The solution for cancelling duplicate builds automatically is > to add this definition to the workflow definition: > > > > > >>>> concurrency: > > > > > >>>> group: ${{ github.workflow }}-${{ github.ref }} > > > > > >>>> cancel-in-progress: true > > > > > >>>> > > > > > >>>> I added this to all maintenance branch GitHub Actions > workflows: > > > > > >>>> > > > > > >>>> branch-2.10 change: > > > > > >>>> > https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7 > > > > > >>>> branch-2.9 change: > > > > > >>>> > https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b > > > > > >>>> branch-2.8 change: > > > > > >>>> > https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54 > > > > > >>>> branch-2.7: > > > > > >>>> > https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630 > > > > > >>>> > > > > > >>>> branch-2.11 already contains the necessary config for > cancelling duplicate builds. > > > > > >>>> > > > > > >>>> The benefit of the above change is that when multiple commits > are cherry-picked to a branch at once, only the build of the last commit > will get run eventually. The builds for the intermediate commits will get > cancelled. Obviously there's a tradeoff here that we don't get the > information if one of the earlier commits breaks the build. It's the cost > that we need to pay. Nevertheless our build is so flaky that it's hard to > determine whether a failed build result is only caused by bad flaky test or > whether it's an actual failure. Because of this we don't lose anything by > cancelling builds. It's more important to save build resources. In the > maintenance branches for 2.10 and older, the average total build time > consumed is around 20 hours which is a lot. > > > > > >>>> > > > > > >>>> At this time, the overhead of maintenance branch builds > doesn't seem to be the source of the problems. There must be some other > issue which is possibly related to exceeding a usage quota. Hopefully we > get the CI slowness issue solved asap. > > > > > >>>> > > > > > >>>> BR, > > > > > >>>> > > > > > >>>> Lari > > > > > >>>> > > > > > >>>> > > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote: > > > > > >>>>> Hi, > > > > > >>>>> > > > > > >>>>> GitHub Actions builds have been piling up in the build queue > in the last few days. > > > > > >>>>> I posted on bui...@apache.org > https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and > created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 > about this issue. > > > > > >>>>> There's also a thread on the-asf slack, > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . > > > > > >>>>> > > > > > >>>>> It seems that our build queue is finally getting picked up, > but it would be great to see if we hit quota and whether that is the cause > of pauses. > > > > > >>>>> > > > > > >>>>> Another issue is that the master branch broke after merging > 2 conflicting PRs. > > > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 . > > > > > >>>>> > > > > > >>>>> Merging PRs will be slow until we have these 2 problems > solved and existing PRs rebased over the changes. Let's prioritize merging > #17300 before pushing more changes. > > > > > >>>>> > > > > > >>>>> I'd like to point out that a good way to get build feedback > before sending a PR, is to run builds on your personal GitHub Actions CI. > The benefit of this is that it doesn't consume the shared quota and builds > usually start instantly. > > > > > >>>>> There are instructions in the contributors guide about this. > > > > > >>>>> > https://pulsar.apache.org/contributing/#ci-testing-in-your-fork > > > > > >>>>> You simply open PRs to your own fork of apache/pulsar to run > builds on your personal GitHub Actions CI. > > > > > >>>>> > > > > > >>>>> BR, > > > > > >>>>> > > > > > >>>>> Lari > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>> > > > > > >>> > > > > > >> > > > > > > > > > > > > > > > > > > > >