Re: Pulsar CI congested, master branch build broken

Nicolò Boschi Thu, 08 Sep 2022 11:41:38 -0700

Dear community,

The plan has been executed.
The summary of our actions is:
1. We cancelled all pending jobs (queue and in-progress)
2. We removed the required checks to be able to merge improvements on the
CI workflow
3. We merged a couple of improvements:
   1. workarounded the possible bug triggered by jobs retries. Now
broker flaky tests are in a dedicated workflow
   2. moved known flaky tests to the flaky suite
   3. optimized the runner consumption for docs-only and cpp-only pulls
4. We reactivated the required checks.



Now it's possible to come back to normal life.
1. You must rebase your branch to the latest master (there's a button for
you in the UI) or eventually you can close/reopen the pull to trigger the
checks
2. You can merge a pull request if you want
3. You will find a new job in the Checks section called "Pulsar CI / Pulsar
CI checks completed" that indicates the Pulsar CI successfully passed

There's a slight chance that the CI will be stuck again in the next few
days but we will take it monitored.

Thanks Lari for the nice work!

Regards,
Nicolò Boschi


Il giorno gio 8 set 2022 alle ore 10:55 Lari Hotari <lhot...@apache.org> ha
scritto:

> Thank you Nicolo.
> There's lazy consensus, let's go forward with the action plan.
>
> -Lari
>
> On 2022/09/08 08:16:05 Nicolò Boschi wrote:
> > This is the pull for step 2. https://github.com/apache/pulsar/pull/17539
> >
> > This is the script I'm going to use to cancel pending workflows.
> >
> https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js
> >
> > I'm going to run the script in minutes.
> >
> > I advertised on Slack about what is happening:
> >
> https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E
> >
> > >we’re going to execute the plan described in the ML. So any queued
> actions
> > will be cancelled. In order to validate your pull it is suggested to run
> > the actions in your own Pulsar fork. Please don’t re-run failed jobs or
> > push any other commits to avoid triggering new actions
> >
> >
> > Nicolò Boschi
> >
> >
> > Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <
> boschi1...@gmail.com>
> > ha scritto:
> >
> > > Thanks Lari for the detailed explanation. This is kind of an emergency
> > > situation and I believe your plan is the way to go now.
> > >
> > > I already prepared a pull for moving the flaky suite out of the Pulsar
> CI
> > > workflow: https://github.com/nicoloboschi/pulsar/pull/8
> > > I can take care of the execution of the plan.
> > >
> > > > 1. Cancel all existing builds in_progress or queued
> > >
> > > I have a script locally that uses GHA to check and cancel the pending
> > > runs. We can extend it to all the queued builds (will share it soon).
> > >
> > > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> merging
> > > PRs.
> > > > 3. Wait for build to run for .asf.yaml change, merge it
> > >
> > > After the pull is out, we'll need to cancel all other workflows that
> > > contributors may inadvertently have triggered.
> > >
> > > > 4. Disable all workflows
> > > > 5. Process specific PRs manually to improve the situation.
> > > >    - Make GHA workflow improvements such as
> > > https://github.com/apache/pulsar/pull/17491 and
> > > https://github.com/apache/pulsar/pull/17490
> > > >    - Quarantine all very flaky tests so that everyone doesn't waste
> time
> > > with those. It should be possible to merge a PR even when a quarantined
> > > test fails.
> > >
> > > in this step we will merge this
> > > https://github.com/nicoloboschi/pulsar/pull/8
> > >
> > > I want to add to the list this improvement to reduce runners usage in
> case
> > > of doc or cpp changes.
> > > https://github.com/nicoloboschi/pulsar/pull/7
> > >
> > >
> > > > 6. Rebase PRs (or close and re-open) that would be processed next so
> > > that changes are picked up
> > >
> > > It's better to leave this task to the author of the pull in order to
> not
> > > create too much load at the same time
> > >
> > > > 7. Enable workflows
> > > > 8. Start processing PRs with checks to see if things are handled in a
> > > better way.
> > > > 9. When things are stable, enable required checks again in
> .asf.yaml, in
> > > the meantime be careful about merging PRs
> > > > 10. Fix quarantined flaky tests
> > >
> > >
> > > Nicolò Boschi
> > >
> > >
> > > Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <
> lhot...@apache.org>
> > > ha scritto:
> > >
> > >> If my assumption of the GitHub usage metrics bug in the GitHub Actions
> > >> build job queue fairness algorithm is correct, what would help is
> running
> > >> the flaky unit test group outside of Pulsar CI workflow. In that
> case, the
> > >> impact of the usage metrics would be limited.
> > >>
> > >> The example of
> > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage shows
> > >> this flaw as explained in the previous email. The total reported
> execution
> > >> time in that report is 1d 1h 40m 21s of usage and the actual usage is
> about
> > >> 1/3 of this.
> > >>
> > >> When we move the most commonly failing job out of Pulsar CI workflow,
> the
> > >> impact of the possible usage metrics bug would be much less. I hope
> GitHub
> > >> support responds to my issue and queries about this bug. It might
> take up
> > >> to 7 days to get a reply and for technical questions more time. In the
> > >> meantime we need a solution for getting over this CI slowness issue.
> > >>
> > >> -Lari
> > >>
> > >>
> > >>
> > >> On 2022/09/08 06:34:42 Lari Hotari wrote:
> > >> > My current assumption of the CI slowness problem is that the usage
> > >> metrics for Apache Pulsar builds on GitHub side is done incorrectly
> and
> > >> that is resulting in apache/pulsar builds getting throttled. This
> > >> assumption might be wrong, but it's the best guess at the moment.
> > >> >
> > >> > The facts that support this assumption is that when re-running
> failed
> > >> jobs in a workflow, the execution times for previously successful
> jobs get
> > >> counted as if they have all run:
> > >> > Here's an example:
> > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > >> > The reported total usage is about 3x than the actual usage.
> > >> >
> > >> > The assumption that I have is that the "fairness algorithm" that
> GitHub
> > >> uses to provide all Apache projects about the same amount of GitHub
> Actions
> > >> resources would take this flawed usage as the basis of it's decisions
> and
> > >> it decides to throttle apache/pulsar builds.
> > >> >
> > >> > The reason why we are getting hit by this now is that there is a
> high
> > >> number of flaky test failures that cause almost every build to fail
> and we
> > >> have been re-running a lot of builds.
> > >> >
> > >> > The other fact to support the theory of flawed usage metrics used in
> > >> the fairness algorithm is that other Apache projects aren't reporting
> > >> issues about GitHub Actions slowness. This is mentioned in Jarek
> Potiuk's
> > >> comments on INFRA-23633 [1]:
> > >> > > Unlike the case 2 years ago, the problem is not affecting all
> > >> projects. In Apache Airflow we do > not see any particular slow-down
> with
> > >> Public Runners at this moment (just checked - >
> > >> > > everything is "as usual").. So I'd say it is something specific to
> > >> Pulsar not to "ASF" as a whole.
> > >> >
> > >> > There are also other comments from Jarek about the GitHub "fairness
> > >> algorithm" (comment [2], other comment [3])
> > >> > > But I believe the current problem is different - it might be
> (looking
> > >> at your jobs) simply a bug
> > >> > > in GA that you hit or indeed your demands are simply too high.
> > >> >
> > >> > I have opened tickets (2 tickets: 2 days ago and yesterday) to
> > >> support.github.com and there hasn't been any response to the ticket.
> It
> > >> might take up to 7 days to get a response. We cannot rely on GitHub
> Support
> > >> resolving this issue.
> > >> >
> > >> > I propose that we go ahead with the previously suggested action plan
> > >> > > One possible way forward:
> > >> > > 1. Cancel all existing builds in_progress or queued
> > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > >> merging PRs.
> > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > >> > > 4. Disable all workflows
> > >> > > 5. Process specific PRs manually to improve the situation.
> > >> > >    - Make GHA workflow improvements such as
> > >> https://github.com/apache/pulsar/pull/17491 and
> > >> https://github.com/apache/pulsar/pull/17490
> > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> waste
> > >> time with those. It should be possible to merge a PR even when a
> > >> quarantined test fails.
> > >> > > 6. Rebase PRs (or close and re-open) that would be processed next
> so
> > >> that changes are picked up
> > >> > > 7. Enable workflows
> > >> > > 8. Start processing PRs with checks to see if things are handled
> in a
> > >> better way.
> > >> > > 9. When things are stable, enable required checks again in
> .asf.yaml,
> > >> in the meantime be careful about merging PRs
> > >> > > 10. Fix quarantined flaky tests
> > >> >
> > >> > To clarify, steps 1-6 would be done optimally in 1 day and we would
> > >> stop processing ordinary PRs during this time. We would only handle
> PRs
> > >> that fix the CI situation during this exceptional period.
> > >> >
> > >> > -Lari
> > >> >
> > >> > Links to Jarek's comments:
> > >> > [1]
> > >>
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> > >> > [2]
> > >>
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > >> > [3]
> > >>
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > >> >
> > >> > On 2022/09/07 17:01:43 Lari Hotari wrote:
> > >> > > One possible way forward:
> > >> > > 1. Cancel all existing builds in_progress or queued
> > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > >> merging PRs.
> > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > >> > > 4. Disable all workflows
> > >> > > 5. Process specific PRs manually to improve the situation.
> > >> > >    - Make GHA workflow improvements such as
> > >> https://github.com/apache/pulsar/pull/17491 and
> > >> https://github.com/apache/pulsar/pull/17490
> > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> waste
> > >> time with those. It should be possible to merge a PR even when a
> > >> quarantined test fails.
> > >> > > 6. Rebase PRs (or close and re-open) that would be processed next
> so
> > >> that changes are picked up
> > >> > > 7. Enable workflows
> > >> > > 8. Start processing PRs with checks to see if things are handled
> in a
> > >> better way.
> > >> > > 9. When things are stable, enable required checks again in
> .asf.yaml,
> > >> in the meantime be careful about merging PRs
> > >> > > 10. Fix quarantined flaky tests
> > >> > >
> > >> > > -Lari
> > >> > >
> > >> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > >> > > > The problem with CI is becoming worse. The build queue is 235
> jobs
> > >> now and the queue time is over 7 hours.
> > >> > > >
> > >> > > > We will need to start shedding load in the build queue and get
> some
> > >> fixes in.
> > >> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues to
> > >> contain details about some activities. I have created 2 GitHub Support
> > >> tickets, but usually it takes up to a week to get a response.
> > >> > > >
> > >> > > > I have some assumptions about the issue, but they are just
> > >> assumptions.
> > >> > > > One oddity is that when re-running failed jobs is used in a
> large
> > >> workflow, the execution times for previously successful jobs get
> counted as
> > >> if they have run.
> > >> > > > Here's an example:
> > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > >> > > > The reported usage is about 3x than the actual usage.
> > >> > > > The assumption that I have is that the "fairness algorithm" that
> > >> GitHub uses to provide all Apache projects about the same amount of
> GitHub
> > >> Actions resources would take this flawed usage as the basis of it's
> > >> decisions.
> > >> > > > The reason why we are getting hit by this now is that there is a
> > >> high number of flaky test failures that cause almost every build to
> fail
> > >> and we are re-running a lot of builds.
> > >> > > >
> > >> > > > Another problem there is that the GitHub Actions search doesn't
> > >> always show all workflow runs that are running. This has happened
> before
> > >> when the GitHub Actions workflow search index was corrupted. GitHub
> Support
> > >> resolved that by rebuilding the search index with some manual admin
> > >> operation behind the scenes.
> > >> > > >
> > >> > > > I'm proposing that we start shedding load from CI by cancelling
> > >> build jobs and selecting which jobs to process so that we get the CI
> issue
> > >> resolved. We might also have to disable required checks so that we
> have
> > >> some way to get changes merged while CI doesn't work properly.
> > >> > > >
> > >> > > > I'm expecting lazy consensus on fixing CI unless someone
> proposes a
> > >> better plan. Let's keep everyone informed in this mailing list thread.
> > >> > > >
> > >> > > > -Lari
> > >> > > >
> > >> > > >
> > >> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > >> > > > > We are going to need to take actions to fix our problems. See
> > >>
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > >> > > > >
> > >> > > > > Jarek has done a large amount of GitHub Action work with
> Apache
> > >> Airflow and his suggestions might be helpful. One of his suggestions
> was
> > >> Apache Yetus. I think he means using the Maven plugins -
> > >> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > >> > > > >
> > >> > > > >
> > >> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lhot...@apache.org
> >
> > >> wrote:
> > >> > > > > >
> > >> > > > > > The Apache Infra ticket is
> > >> https://issues.apache.org/jira/browse/INFRA-23633 .
> > >> > > > > >
> > >> > > > > > -Lari
> > >> > > > > >
> > >> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > >> > > > > >> I asked for an update on the Apache org GitHub Actions
> usage
> > >> stats from Gavin McDonald on the-asf slack in this thread:
> > >>
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > >> .
> > >> > > > > >>
> > >> > > > > >> I hope we get this issue resolved since it delays PR
> > >> processing a lot.
> > >> > > > > >>
> > >> > > > > >> -Lari
> > >> > > > > >>
> > >> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > >> > > > > >>> Pulsar CI continues to be congested, and the build queue
> [1]
> > >> is very long at the moment. There are 147 build jobs in the queue and
> 16
> > >> jobs in progress at the moment.
> > >> > > > > >>>
> > >> > > > > >>> I would strongly advice everyone to use "personal CI" to
> > >> mitigate the issue of the long delay of CI feedback. You can simply
> open a
> > >> PR to your own personal fork of apache/pulsar to run the builds in
> your
> > >> "personal CI". There's more details in the previous emails in this
> thread.
> > >> > > > > >>>
> > >> > > > > >>> -Lari
> > >> > > > > >>>
> > >> > > > > >>> [1] - build queue:
> > >> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > >> > > > > >>>
> > >> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > >> > > > > >>>> Pulsar CI continues to be congested, and the build queue
> is
> > >> long.
> > >> > > > > >>>>
> > >> > > > > >>>> I would strongly advice everyone to use "personal CI" to
> > >> mitigate the issue of the long delay of CI feedback. You can simply
> open a
> > >> PR to your own personal fork of apache/pulsar to run the builds in
> your
> > >> "personal CI". There's more details in the previous email in this
> thread.
> > >> > > > > >>>>
> > >> > > > > >>>> Some updates:
> > >> > > > > >>>>
> > >> > > > > >>>> There has been a discussion with Gavin McDonald from ASF
> > >> infra on the-asf slack about getting usage reports from GitHub to
> support
> > >> the investigation. Slack thread is the same one mentioned in the
> previous
> > >> email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> .
> > >> Gavin already requested the usage report in GitHub UI, but it produced
> > >> invalid results.
> > >> > > > > >>>>
> > >> > > > > >>>> I made a change to mitigate a source of additional GitHub
> > >> Actions overhead.
> > >> > > > > >>>> In the past, each cherry-picked commit to a maintenance
> > >> branch of Pulsar has triggered a lot of workflow runs.
> > >> > > > > >>>>
> > >> > > > > >>>> The solution for cancelling duplicate builds
> automatically
> > >> is to add this definition to the workflow definition:
> > >> > > > > >>>> concurrency:
> > >> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > >> > > > > >>>>  cancel-in-progress: true
> > >> > > > > >>>>
> > >> > > > > >>>> I added this to all maintenance branch GitHub Actions
> > >> workflows:
> > >> > > > > >>>>
> > >> > > > > >>>> branch-2.10 change:
> > >> > > > > >>>>
> > >>
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > >> > > > > >>>> branch-2.9 change:
> > >> > > > > >>>>
> > >>
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > >> > > > > >>>> branch-2.8 change:
> > >> > > > > >>>>
> > >>
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > >> > > > > >>>> branch-2.7:
> > >> > > > > >>>>
> > >>
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > >> > > > > >>>>
> > >> > > > > >>>> branch-2.11 already contains the necessary config for
> > >> cancelling duplicate builds.
> > >> > > > > >>>>
> > >> > > > > >>>> The benefit of the above change is that when multiple
> > >> commits are cherry-picked to a branch at once, only the build of the
> last
> > >> commit will get run eventually. The builds for the intermediate
> commits
> > >> will get cancelled. Obviously there's a tradeoff here that we don't
> get the
> > >> information if one of the earlier commits breaks the build. It's the
> cost
> > >> that we need to pay. Nevertheless our build is so flaky that it's
> hard to
> > >> determine whether a failed build result is only caused by bad flaky
> test or
> > >> whether it's an actual failure. Because of this we don't lose
> anything by
> > >> cancelling builds. It's more important to save build resources. In the
> > >> maintenance branches for 2.10 and older, the average total build time
> > >> consumed is around 20 hours which is a lot.
> > >> > > > > >>>>
> > >> > > > > >>>> At this time, the overhead of maintenance branch builds
> > >> doesn't seem to be the source of the problems. There must be some
> other
> > >> issue which is possibly related to exceeding a usage quota. Hopefully
> we
> > >> get the CI slowness issue solved asap.
> > >> > > > > >>>>
> > >> > > > > >>>> BR,
> > >> > > > > >>>>
> > >> > > > > >>>> Lari
> > >> > > > > >>>>
> > >> > > > > >>>>
> > >> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > >> > > > > >>>>> Hi,
> > >> > > > > >>>>>
> > >> > > > > >>>>> GitHub Actions builds have been piling up in the build
> > >> queue in the last few days.
> > >> > > > > >>>>> I posted on bui...@apache.org
> > >> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> > >> created INFRA ticket
> https://issues.apache.org/jira/browse/INFRA-23633
> > >> about this issue.
> > >> > > > > >>>>> There's also a thread on the-asf slack,
> > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > >> > > > > >>>>>
> > >> > > > > >>>>> It seems that our build queue is finally getting picked
> up,
> > >> but it would be great to see if we hit quota and whether that is the
> cause
> > >> of pauses.
> > >> > > > > >>>>>
> > >> > > > > >>>>> Another issue is that the master branch broke after
> merging
> > >> 2 conflicting PRs.
> > >> > > > > >>>>> The fix is in
> https://github.com/apache/pulsar/pull/17300
> > >> .
> > >> > > > > >>>>>
> > >> > > > > >>>>> Merging PRs will be slow until we have these 2 problems
> > >> solved and existing PRs rebased over the changes. Let's prioritize
> merging
> > >> #17300 before pushing more changes.
> > >> > > > > >>>>>
> > >> > > > > >>>>> I'd like to point out that a good way to get build
> feedback
> > >> before sending a PR, is to run builds on your personal GitHub Actions
> CI.
> > >> The benefit of this is that it doesn't consume the shared quota and
> builds
> > >> usually start instantly.
> > >> > > > > >>>>> There are instructions in the contributors guide about
> > >> this.
> > >> > > > > >>>>>
> > >> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > >> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar to
> > >> run builds on your personal GitHub Actions CI.
> > >> > > > > >>>>>
> > >> > > > > >>>>> BR,
> > >> > > > > >>>>>
> > >> > > > > >>>>> Lari
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>
> > >> > > > > >>>
> > >> > > > > >>
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Pulsar CI congested, master branch build broken

Reply via email to