Dear Pulsar community,

Here's a report of the flaky tests in Pulsar CI during the observation
period of 2023-09-24 to 2023-10-01.

There seems to be quite a lot of flakiness which result in a need to
restart test runs quite a few times before a PR can be processed. Yesterday
I made a few simple PRs and in one of them there were flaky test failures 6
times in a row. That is pretty frustrating and a waste of time for everyone
working on Pulsar PRs.

The flaky tests are detected by evaluating the CI builds of all PRs that
were merged during the observation period. The logs will be checked for
builds where the SHA of the head of the PR matches the SHA which got
merged. This ensures that all found exceptions are real flakes since no
changes were made to the PR to make the tests pass later so that the PR was
merged successfully. The processing of the logs is done using a script that
I implemented a long time ago for detecting Pulsar flaky tests.

Kudos to Nicolò for not only integrating the flaky test detection script
with GH actions, but also for enhancing its capabilities in terms of
investigation and reporting. There was a minor hiccup due to changes in the
Maven surefire plugin update which affected the script, but I've since
resolved the issue and executed the report successfully.

Here are the 3 most flaky tests:

https://github.com/apache/pulsar/issues/21284
CompactionTest.testDispatcherMaxReadSizeBytes /
StrategicCompactionTest.testDispatcherMaxReadSizeBytes
total 9 failures

https://github.com/apache/pulsar/issues/21286
AuditorBookieTest.setUp
AuditorPeriodicCheckTest.setUp
BookieAutoRecoveryTest.setUp
TestAutoRecoveryAlongWithBookieServers.setUp
TestReplicationWorker.setUp
total 8 failures

https://github.com/apache/pulsar/issues/21289
        AuditorLedgerCheckerTest.setUp
AutoRecoveryMainTest.setUp
total 3 failures

Fixing these 3 issues would reduce the CI flakiness significantly. 2 of the
issues above are related to starting a Zookeeper server during the test
runs. It's possible that root causes of the issues get hidden since tests
will get retried. For detailed investigation, it's usually necessary to
download the surefire reports that the CI job uploads after a failure.

More details in this Google sheet:
https://docs.google.com/spreadsheets/d/1gtu-XrLumjBFPk9kDKcJOQfxsvIE2EiuZO7IB7ab6q0/edit#gid=1446387426

Detailed reports and flaky test reporting source:
https://github.com/lhotari/pulsar-flakes/tree/master/2023-09-24-to-2023-10-01

To coordinate the work,

1) please search for an existing issues or search for all flaky issues with
"flaky" or the test class name (without package) in the search:
https://github.com/apache/pulsar/issues?q=is%3Aopen+flaky+sort%3Aupdated-desc

2) If there isn't an issue for a particular flaky test failure that you'd
like to fix, please create an issue using the "Flaky test" template at
https://github.com/apache/pulsar/issues/new/choose

3) Please comment on the issue that you are working on it.

Let's reduce the flakiness to make contributing to Pulsar a better
experience!

-Lari

Reply via email to