Thank you Nicolo. There's lazy consensus, let's go forward with the action plan.
-Lari On 2022/09/08 08:16:05 Nicolò Boschi wrote: > This is the pull for step 2. https://github.com/apache/pulsar/pull/17539 > > This is the script I'm going to use to cancel pending workflows. > https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js > > I'm going to run the script in minutes. > > I advertised on Slack about what is happening: > https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E > > >we’re going to execute the plan described in the ML. So any queued actions > will be cancelled. In order to validate your pull it is suggested to run > the actions in your own Pulsar fork. Please don’t re-run failed jobs or > push any other commits to avoid triggering new actions > > > Nicolò Boschi > > > Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <boschi1...@gmail.com> > ha scritto: > > > Thanks Lari for the detailed explanation. This is kind of an emergency > > situation and I believe your plan is the way to go now. > > > > I already prepared a pull for moving the flaky suite out of the Pulsar CI > > workflow: https://github.com/nicoloboschi/pulsar/pull/8 > > I can take care of the execution of the plan. > > > > > 1. Cancel all existing builds in_progress or queued > > > > I have a script locally that uses GHA to check and cancel the pending > > runs. We can extend it to all the queued builds (will share it soon). > > > > > 2. Edit .asf.yaml and drop the "required checks" requirement for merging > > PRs. > > > 3. Wait for build to run for .asf.yaml change, merge it > > > > After the pull is out, we'll need to cancel all other workflows that > > contributors may inadvertently have triggered. > > > > > 4. Disable all workflows > > > 5. Process specific PRs manually to improve the situation. > > > - Make GHA workflow improvements such as > > https://github.com/apache/pulsar/pull/17491 and > > https://github.com/apache/pulsar/pull/17490 > > > - Quarantine all very flaky tests so that everyone doesn't waste time > > with those. It should be possible to merge a PR even when a quarantined > > test fails. > > > > in this step we will merge this > > https://github.com/nicoloboschi/pulsar/pull/8 > > > > I want to add to the list this improvement to reduce runners usage in case > > of doc or cpp changes. > > https://github.com/nicoloboschi/pulsar/pull/7 > > > > > > > 6. Rebase PRs (or close and re-open) that would be processed next so > > that changes are picked up > > > > It's better to leave this task to the author of the pull in order to not > > create too much load at the same time > > > > > 7. Enable workflows > > > 8. Start processing PRs with checks to see if things are handled in a > > better way. > > > 9. When things are stable, enable required checks again in .asf.yaml, in > > the meantime be careful about merging PRs > > > 10. Fix quarantined flaky tests > > > > > > Nicolò Boschi > > > > > > Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <lhot...@apache.org> > > ha scritto: > > > >> If my assumption of the GitHub usage metrics bug in the GitHub Actions > >> build job queue fairness algorithm is correct, what would help is running > >> the flaky unit test group outside of Pulsar CI workflow. In that case, the > >> impact of the usage metrics would be limited. > >> > >> The example of > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage shows > >> this flaw as explained in the previous email. The total reported execution > >> time in that report is 1d 1h 40m 21s of usage and the actual usage is about > >> 1/3 of this. > >> > >> When we move the most commonly failing job out of Pulsar CI workflow, the > >> impact of the possible usage metrics bug would be much less. I hope GitHub > >> support responds to my issue and queries about this bug. It might take up > >> to 7 days to get a reply and for technical questions more time. In the > >> meantime we need a solution for getting over this CI slowness issue. > >> > >> -Lari > >> > >> > >> > >> On 2022/09/08 06:34:42 Lari Hotari wrote: > >> > My current assumption of the CI slowness problem is that the usage > >> metrics for Apache Pulsar builds on GitHub side is done incorrectly and > >> that is resulting in apache/pulsar builds getting throttled. This > >> assumption might be wrong, but it's the best guess at the moment. > >> > > >> > The facts that support this assumption is that when re-running failed > >> jobs in a workflow, the execution times for previously successful jobs get > >> counted as if they have all run: > >> > Here's an example: > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage > >> > The reported total usage is about 3x than the actual usage. > >> > > >> > The assumption that I have is that the "fairness algorithm" that GitHub > >> uses to provide all Apache projects about the same amount of GitHub Actions > >> resources would take this flawed usage as the basis of it's decisions and > >> it decides to throttle apache/pulsar builds. > >> > > >> > The reason why we are getting hit by this now is that there is a high > >> number of flaky test failures that cause almost every build to fail and we > >> have been re-running a lot of builds. > >> > > >> > The other fact to support the theory of flawed usage metrics used in > >> the fairness algorithm is that other Apache projects aren't reporting > >> issues about GitHub Actions slowness. This is mentioned in Jarek Potiuk's > >> comments on INFRA-23633 [1]: > >> > > Unlike the case 2 years ago, the problem is not affecting all > >> projects. In Apache Airflow we do > not see any particular slow-down with > >> Public Runners at this moment (just checked - > > >> > > everything is "as usual").. So I'd say it is something specific to > >> Pulsar not to "ASF" as a whole. > >> > > >> > There are also other comments from Jarek about the GitHub "fairness > >> algorithm" (comment [2], other comment [3]) > >> > > But I believe the current problem is different - it might be (looking > >> at your jobs) simply a bug > >> > > in GA that you hit or indeed your demands are simply too high. > >> > > >> > I have opened tickets (2 tickets: 2 days ago and yesterday) to > >> support.github.com and there hasn't been any response to the ticket. It > >> might take up to 7 days to get a response. We cannot rely on GitHub Support > >> resolving this issue. > >> > > >> > I propose that we go ahead with the previously suggested action plan > >> > > One possible way forward: > >> > > 1. Cancel all existing builds in_progress or queued > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for > >> merging PRs. > >> > > 3. Wait for build to run for .asf.yaml change, merge it > >> > > 4. Disable all workflows > >> > > 5. Process specific PRs manually to improve the situation. > >> > > - Make GHA workflow improvements such as > >> https://github.com/apache/pulsar/pull/17491 and > >> https://github.com/apache/pulsar/pull/17490 > >> > > - Quarantine all very flaky tests so that everyone doesn't waste > >> time with those. It should be possible to merge a PR even when a > >> quarantined test fails. > >> > > 6. Rebase PRs (or close and re-open) that would be processed next so > >> that changes are picked up > >> > > 7. Enable workflows > >> > > 8. Start processing PRs with checks to see if things are handled in a > >> better way. > >> > > 9. When things are stable, enable required checks again in .asf.yaml, > >> in the meantime be careful about merging PRs > >> > > 10. Fix quarantined flaky tests > >> > > >> > To clarify, steps 1-6 would be done optimally in 1 day and we would > >> stop processing ordinary PRs during this time. We would only handle PRs > >> that fix the CI situation during this exceptional period. > >> > > >> > -Lari > >> > > >> > Links to Jarek's comments: > >> > [1] > >> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749 > >> > [2] > >> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893 > >> > [3] > >> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893 > >> > > >> > On 2022/09/07 17:01:43 Lari Hotari wrote: > >> > > One possible way forward: > >> > > 1. Cancel all existing builds in_progress or queued > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for > >> merging PRs. > >> > > 3. Wait for build to run for .asf.yaml change, merge it > >> > > 4. Disable all workflows > >> > > 5. Process specific PRs manually to improve the situation. > >> > > - Make GHA workflow improvements such as > >> https://github.com/apache/pulsar/pull/17491 and > >> https://github.com/apache/pulsar/pull/17490 > >> > > - Quarantine all very flaky tests so that everyone doesn't waste > >> time with those. It should be possible to merge a PR even when a > >> quarantined test fails. > >> > > 6. Rebase PRs (or close and re-open) that would be processed next so > >> that changes are picked up > >> > > 7. Enable workflows > >> > > 8. Start processing PRs with checks to see if things are handled in a > >> better way. > >> > > 9. When things are stable, enable required checks again in .asf.yaml, > >> in the meantime be careful about merging PRs > >> > > 10. Fix quarantined flaky tests > >> > > > >> > > -Lari > >> > > > >> > > On 2022/09/07 16:47:09 Lari Hotari wrote: > >> > > > The problem with CI is becoming worse. The build queue is 235 jobs > >> now and the queue time is over 7 hours. > >> > > > > >> > > > We will need to start shedding load in the build queue and get some > >> fixes in. > >> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues to > >> contain details about some activities. I have created 2 GitHub Support > >> tickets, but usually it takes up to a week to get a response. > >> > > > > >> > > > I have some assumptions about the issue, but they are just > >> assumptions. > >> > > > One oddity is that when re-running failed jobs is used in a large > >> workflow, the execution times for previously successful jobs get counted as > >> if they have run. > >> > > > Here's an example: > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage > >> > > > The reported usage is about 3x than the actual usage. > >> > > > The assumption that I have is that the "fairness algorithm" that > >> GitHub uses to provide all Apache projects about the same amount of GitHub > >> Actions resources would take this flawed usage as the basis of it's > >> decisions. > >> > > > The reason why we are getting hit by this now is that there is a > >> high number of flaky test failures that cause almost every build to fail > >> and we are re-running a lot of builds. > >> > > > > >> > > > Another problem there is that the GitHub Actions search doesn't > >> always show all workflow runs that are running. This has happened before > >> when the GitHub Actions workflow search index was corrupted. GitHub Support > >> resolved that by rebuilding the search index with some manual admin > >> operation behind the scenes. > >> > > > > >> > > > I'm proposing that we start shedding load from CI by cancelling > >> build jobs and selecting which jobs to process so that we get the CI issue > >> resolved. We might also have to disable required checks so that we have > >> some way to get changes merged while CI doesn't work properly. > >> > > > > >> > > > I'm expecting lazy consensus on fixing CI unless someone proposes a > >> better plan. Let's keep everyone informed in this mailing list thread. > >> > > > > >> > > > -Lari > >> > > > > >> > > > > >> > > > On 2022/09/06 14:41:07 Dave Fisher wrote: > >> > > > > We are going to need to take actions to fix our problems. See > >> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749 > >> > > > > > >> > > > > Jarek has done a large amount of GitHub Action work with Apache > >> Airflow and his suggestions might be helpful. One of his suggestions was > >> Apache Yetus. I think he means using the Maven plugins - > >> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/ > >> > > > > > >> > > > > > >> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lhot...@apache.org> > >> wrote: > >> > > > > > > >> > > > > > The Apache Infra ticket is > >> https://issues.apache.org/jira/browse/INFRA-23633 . > >> > > > > > > >> > > > > > -Lari > >> > > > > > > >> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote: > >> > > > > >> I asked for an update on the Apache org GitHub Actions usage > >> stats from Gavin McDonald on the-asf slack in this thread: > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 > >> . > >> > > > > >> > >> > > > > >> I hope we get this issue resolved since it delays PR > >> processing a lot. > >> > > > > >> > >> > > > > >> -Lari > >> > > > > >> > >> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote: > >> > > > > >>> Pulsar CI continues to be congested, and the build queue [1] > >> is very long at the moment. There are 147 build jobs in the queue and 16 > >> jobs in progress at the moment. > >> > > > > >>> > >> > > > > >>> I would strongly advice everyone to use "personal CI" to > >> mitigate the issue of the long delay of CI feedback. You can simply open a > >> PR to your own personal fork of apache/pulsar to run the builds in your > >> "personal CI". There's more details in the previous emails in this thread. > >> > > > > >>> > >> > > > > >>> -Lari > >> > > > > >>> > >> > > > > >>> [1] - build queue: > >> https://github.com/apache/pulsar/actions?query=is%3Aqueued > >> > > > > >>> > >> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote: > >> > > > > >>>> Pulsar CI continues to be congested, and the build queue is > >> long. > >> > > > > >>>> > >> > > > > >>>> I would strongly advice everyone to use "personal CI" to > >> mitigate the issue of the long delay of CI feedback. You can simply open a > >> PR to your own personal fork of apache/pulsar to run the builds in your > >> "personal CI". There's more details in the previous email in this thread. > >> > > > > >>>> > >> > > > > >>>> Some updates: > >> > > > > >>>> > >> > > > > >>>> There has been a discussion with Gavin McDonald from ASF > >> infra on the-asf slack about getting usage reports from GitHub to support > >> the investigation. Slack thread is the same one mentioned in the previous > >> email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . > >> Gavin already requested the usage report in GitHub UI, but it produced > >> invalid results. > >> > > > > >>>> > >> > > > > >>>> I made a change to mitigate a source of additional GitHub > >> Actions overhead. > >> > > > > >>>> In the past, each cherry-picked commit to a maintenance > >> branch of Pulsar has triggered a lot of workflow runs. > >> > > > > >>>> > >> > > > > >>>> The solution for cancelling duplicate builds automatically > >> is to add this definition to the workflow definition: > >> > > > > >>>> concurrency: > >> > > > > >>>> group: ${{ github.workflow }}-${{ github.ref }} > >> > > > > >>>> cancel-in-progress: true > >> > > > > >>>> > >> > > > > >>>> I added this to all maintenance branch GitHub Actions > >> workflows: > >> > > > > >>>> > >> > > > > >>>> branch-2.10 change: > >> > > > > >>>> > >> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7 > >> > > > > >>>> branch-2.9 change: > >> > > > > >>>> > >> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b > >> > > > > >>>> branch-2.8 change: > >> > > > > >>>> > >> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54 > >> > > > > >>>> branch-2.7: > >> > > > > >>>> > >> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630 > >> > > > > >>>> > >> > > > > >>>> branch-2.11 already contains the necessary config for > >> cancelling duplicate builds. > >> > > > > >>>> > >> > > > > >>>> The benefit of the above change is that when multiple > >> commits are cherry-picked to a branch at once, only the build of the last > >> commit will get run eventually. The builds for the intermediate commits > >> will get cancelled. Obviously there's a tradeoff here that we don't get the > >> information if one of the earlier commits breaks the build. It's the cost > >> that we need to pay. Nevertheless our build is so flaky that it's hard to > >> determine whether a failed build result is only caused by bad flaky test or > >> whether it's an actual failure. Because of this we don't lose anything by > >> cancelling builds. It's more important to save build resources. In the > >> maintenance branches for 2.10 and older, the average total build time > >> consumed is around 20 hours which is a lot. > >> > > > > >>>> > >> > > > > >>>> At this time, the overhead of maintenance branch builds > >> doesn't seem to be the source of the problems. There must be some other > >> issue which is possibly related to exceeding a usage quota. Hopefully we > >> get the CI slowness issue solved asap. > >> > > > > >>>> > >> > > > > >>>> BR, > >> > > > > >>>> > >> > > > > >>>> Lari > >> > > > > >>>> > >> > > > > >>>> > >> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote: > >> > > > > >>>>> Hi, > >> > > > > >>>>> > >> > > > > >>>>> GitHub Actions builds have been piling up in the build > >> queue in the last few days. > >> > > > > >>>>> I posted on bui...@apache.org > >> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and > >> created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 > >> about this issue. > >> > > > > >>>>> There's also a thread on the-asf slack, > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . > >> > > > > >>>>> > >> > > > > >>>>> It seems that our build queue is finally getting picked up, > >> but it would be great to see if we hit quota and whether that is the cause > >> of pauses. > >> > > > > >>>>> > >> > > > > >>>>> Another issue is that the master branch broke after merging > >> 2 conflicting PRs. > >> > > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 > >> . > >> > > > > >>>>> > >> > > > > >>>>> Merging PRs will be slow until we have these 2 problems > >> solved and existing PRs rebased over the changes. Let's prioritize merging > >> #17300 before pushing more changes. > >> > > > > >>>>> > >> > > > > >>>>> I'd like to point out that a good way to get build feedback > >> before sending a PR, is to run builds on your personal GitHub Actions CI. > >> The benefit of this is that it doesn't consume the shared quota and builds > >> usually start instantly. > >> > > > > >>>>> There are instructions in the contributors guide about > >> this. > >> > > > > >>>>> > >> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork > >> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar to > >> run builds on your personal GitHub Actions CI. > >> > > > > >>>>> > >> > > > > >>>>> BR, > >> > > > > >>>>> > >> > > > > >>>>> Lari > >> > > > > >>>>> > >> > > > > >>>>> > >> > > > > >>>>> > >> > > > > >>>>> > >> > > > > >>>>> > >> > > > > >>>>> > >> > > > > >>>>> > >> > > > > >>>>> > >> > > > > >>>> > >> > > > > >>> > >> > > > > >> > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >