Re: Re-running failed flaky builds in refactored Pulsar CI GitHub Actions workflow

Lari Hotari Fri, 08 Apr 2022 00:01:38 -0700

With the new GitHub Actions CI workflow there are cases where you see a red 
mark as a failure, but there's no need to rerun failed jobs since the red 
failure marks are a result of failed test reports (usually from failed flaky 
tests).

The new Pulsar CI workflow renders Junit xml test reports and integrates them 
to the GitHub UI. There are multiple benefits of this. The test failures will 
be shown directly in the PR review. 

You will see red failure marks without a failed job when flaky tests fail, but 
later pass in a retry. The failed test result will get recorded to a test 
report, but there's no need to rerun failed jobs. 

This doesn't block merging, but will show up so that the failures can be 
inspected.  This can be confusing at first, since everyone has been used to 
rerunning jobs when there's a red failure mark shown in the PR.

It might appear that "/pulsarbot rerun-failure-checks" is broken. That's not 
the case. Usually the issue is that there's no failed job or the workflow where 
a job has failed is still executing. A failed job in a workflow can only be 
rerun after the complete workflow completes. That's explained in an earlier 
message in this thread.

With test reports, there's an additional confusion, since GitHub Actions has a 
bug that the test reports get attached randomly to a workflow when multiple 
workflows are executing. It's a known issue and once GitHub fixes the bug, it 
will be resolved.
(here's a link to one of the reports about the GitHub Actions bug: 
https://github.community/t/github-actions-status-checks-created-on-incorrect-check-suite-id/16685)

Please let me know if you have trouble with the new Pulsar CI GitHub Actions 
workflow and let's try to resolve the issues together.

I'll try to find a place to document the details that are mentioned in this 
email thread.

-Lari

On 2022/04/01 14:34:02 Lari Hotari wrote:
> I now realized that my advice to close & reopen PRs to pick up master branch 
> changes is problematic. This will cause issues with "/pulsarbot 
> rerun-failure-checks". The script currently looks for the build to restart 
> with the PR's head commit sha. If closing and reopening is used to start new 
> PR build jobs, all build jobs will have the same head commit sha attached to 
> them. When checking for that failed builds, the script will find also old 
> builds with the same head commit sha and also restart them.
> 
> Please rebased your PR (or merge master branch changes to it) to pick up 
> changes from master. Don't close & reopen PRs as I had advised earlier since 
> it causes problems. The wrong builds will be run and that adds up in the 
> build queue.
> 
> -Lari
> 
> 
> 
> On 2022/04/01 08:38:54 Lari Hotari wrote:
> > Hi all,
> > 
> > There's a small limitation in re-running failed jobs (builds that fail 
> > because of flaky tests) in the refactored Pulsar CI workflow which combines 
> > multiple jobs into a single workflow.
> > 
> > The limitation is that you need to wait for all jobs to complete before 
> > failed jobs can be re-run.
> > Yesterday there was some issue with GitHub Actions and the build queue was 
> > several hours long. When there's enough build capacity and no build queue, 
> > the new workflow finishes in about 1 hour 20 minutes.
> > 
> > Re-running failed jobs can be requested by commenting "/pulsarbot 
> > rerun-failure-checks" on the  PR. This won't do anything if one of the jobs 
> > in the workflow is still executing.
> > 
> > Another confusion has been the new test reporting, which shows all test 
> > results and test failures as checks and annotations in the GitHub UI. 
> > 
> > Here's an example:
> > https://github.com/apache/pulsar/pull/14805/checks?check_run_id=5777139002
> > 
> > There's a limitation in GitHub Actions that the test reports get attached 
> > to the first workflow when a PR triggers more than one workflow. We still 
> > have multiple workflows and the test reports get attached to the "CI - CPP, 
> > Python Tests" workflow. Failed tests will show up as red check marks and in 
> > the case of retries, the test might have succeeded in a later attempt, but 
> > the check shows as failed. This won't prevent merging the PR. Please keep 
> > this small detail in mind when interpreting the build results.
> > 
> > The test reports are very verbose at the moment. This is a problem when 
> > checking the PR build results on GitHub Mobile app. I have created a PR to 
> > reduce test reporting to GitHub Actions UI in this PR: 
> > https://github.com/apache/pulsar/pull/14959
> > 
> > Please let me know if there are any other questions or problems that have 
> > come up with the new refactored Pulsar CI GitHub Actions workflow.
> > 
> > -Lari
> > 
>

Re: Re-running failed flaky builds in refactored Pulsar CI GitHub Actions workflow

Reply via email to