Re: Re-running failed flaky builds in refactored Pulsar CI GitHub Actions workflow

Lari Hotari Thu, 14 Apr 2022 02:25:55 -0700

GitHub Actions has some problem and the UI has a warning
"We are having problems searching workflow runs. The results may not be 
complete."
(I can see this warning on https://github.com/apache/pulsar/actions)


The impact of this is that "/pulsarbot rerun-failure-checks" doesn't work when 
it cannot find the failed or cancelled workflow runs.

-Lari

On 2022/04/08 07:01:33 Lari Hotari wrote:
> With the new GitHub Actions CI workflow there are cases where you see a red 
> mark as a failure, but there's no need to rerun failed jobs since the red 
> failure marks are a result of failed test reports (usually from failed flaky 
> tests).
> 
> The new Pulsar CI workflow renders Junit xml test reports and integrates them 
> to the GitHub UI. There are multiple benefits of this. The test failures will 
> be shown directly in the PR review. 
> 
> You will see red failure marks without a failed job when flaky tests fail, 
> but later pass in a retry. The failed test result will get recorded to a test 
> report, but there's no need to rerun failed jobs. 
> 
> This doesn't block merging, but will show up so that the failures can be 
> inspected.  This can be confusing at first, since everyone has been used to 
> rerunning jobs when there's a red failure mark shown in the PR.
> 
> It might appear that "/pulsarbot rerun-failure-checks" is broken. That's not 
> the case. Usually the issue is that there's no failed job or the workflow 
> where a job has failed is still executing. A failed job in a workflow can 
> only be rerun after the complete workflow completes. That's explained in an 
> earlier message in this thread.
> 
> With test reports, there's an additional confusion, since GitHub Actions has 
> a bug that the test reports get attached randomly to a workflow when multiple 
> workflows are executing. It's a known issue and once GitHub fixes the bug, it 
> will be resolved.
> (here's a link to one of the reports about the GitHub Actions bug: 
> https://github.community/t/github-actions-status-checks-created-on-incorrect-check-suite-id/16685)
> 
> Please let me know if you have trouble with the new Pulsar CI GitHub Actions 
> workflow and let's try to resolve the issues together.
> 
> I'll try to find a place to document the details that are mentioned in this 
> email thread.
> 
> -Lari
> 
> 
> On 2022/04/01 14:34:02 Lari Hotari wrote:
> > I now realized that my advice to close & reopen PRs to pick up master 
> > branch changes is problematic. This will cause issues with "/pulsarbot 
> > rerun-failure-checks". The script currently looks for the build to restart 
> > with the PR's head commit sha. If closing and reopening is used to start 
> > new PR build jobs, all build jobs will have the same head commit sha 
> > attached to them. When checking for that failed builds, the script will 
> > find also old builds with the same head commit sha and also restart them.
> > 
> > Please rebased your PR (or merge master branch changes to it) to pick up 
> > changes from master. Don't close & reopen PRs as I had advised earlier 
> > since it causes problems. The wrong builds will be run and that adds up in 
> > the build queue.
> > 
> > -Lari
> > 
> > 
> > 
> > On 2022/04/01 08:38:54 Lari Hotari wrote:
> > > Hi all,
> > > 
> > > There's a small limitation in re-running failed jobs (builds that fail 
> > > because of flaky tests) in the refactored Pulsar CI workflow which 
> > > combines multiple jobs into a single workflow.
> > > 
> > > The limitation is that you need to wait for all jobs to complete before 
> > > failed jobs can be re-run.
> > > Yesterday there was some issue with GitHub Actions and the build queue 
> > > was several hours long. When there's enough build capacity and no build 
> > > queue, the new workflow finishes in about 1 hour 20 minutes.
> > > 
> > > Re-running failed jobs can be requested by commenting "/pulsarbot 
> > > rerun-failure-checks" on the  PR. This won't do anything if one of the 
> > > jobs in the workflow is still executing.
> > > 
> > > Another confusion has been the new test reporting, which shows all test 
> > > results and test failures as checks and annotations in the GitHub UI. 
> > > 
> > > Here's an example:
> > > https://github.com/apache/pulsar/pull/14805/checks?check_run_id=5777139002
> > > 
> > > There's a limitation in GitHub Actions that the test reports get attached 
> > > to the first workflow when a PR triggers more than one workflow. We still 
> > > have multiple workflows and the test reports get attached to the "CI - 
> > > CPP, Python Tests" workflow. Failed tests will show up as red check marks 
> > > and in the case of retries, the test might have succeeded in a later 
> > > attempt, but the check shows as failed. This won't prevent merging the 
> > > PR. Please keep this small detail in mind when interpreting the build 
> > > results.
> > > 
> > > The test reports are very verbose at the moment. This is a problem when 
> > > checking the PR build results on GitHub Mobile app. I have created a PR 
> > > to reduce test reporting to GitHub Actions UI in this PR: 
> > > https://github.com/apache/pulsar/pull/14959
> > > 
> > > Please let me know if there are any other questions or problems that have 
> > > come up with the new refactored Pulsar CI GitHub Actions workflow.
> > > 
> > > -Lari
> > > 
> > 
>

Re: Re-running failed flaky builds in refactored Pulsar CI GitHub Actions workflow

Reply via email to