Dear Pulsar community, Here's a report of the flaky tests in Pulsar CI during the observation period of 2023-09-24 to 2023-10-01.
There seems to be quite a lot of flakiness which result in a need to restart test runs quite a few times before a PR can be processed. Yesterday I made a few simple PRs and in one of them there were flaky test failures 6 times in a row. That is pretty frustrating and a waste of time for everyone working on Pulsar PRs. The flaky tests are detected by evaluating the CI builds of all PRs that were merged during the observation period. The logs will be checked for builds where the SHA of the head of the PR matches the SHA which got merged. This ensures that all found exceptions are real flakes since no changes were made to the PR to make the tests pass later so that the PR was merged successfully. The processing of the logs is done using a script that I implemented a long time ago for detecting Pulsar flaky tests. Kudos to Nicolò for not only integrating the flaky test detection script with GH actions, but also for enhancing its capabilities in terms of investigation and reporting. There was a minor hiccup due to changes in the Maven surefire plugin update which affected the script, but I've since resolved the issue and executed the report successfully. Here are the 3 most flaky tests: https://github.com/apache/pulsar/issues/21284 CompactionTest.testDispatcherMaxReadSizeBytes / StrategicCompactionTest.testDispatcherMaxReadSizeBytes total 9 failures https://github.com/apache/pulsar/issues/21286 AuditorBookieTest.setUp AuditorPeriodicCheckTest.setUp BookieAutoRecoveryTest.setUp TestAutoRecoveryAlongWithBookieServers.setUp TestReplicationWorker.setUp total 8 failures https://github.com/apache/pulsar/issues/21289 AuditorLedgerCheckerTest.setUp AutoRecoveryMainTest.setUp total 3 failures Fixing these 3 issues would reduce the CI flakiness significantly. 2 of the issues above are related to starting a Zookeeper server during the test runs. It's possible that root causes of the issues get hidden since tests will get retried. For detailed investigation, it's usually necessary to download the surefire reports that the CI job uploads after a failure. More details in this Google sheet: https://docs.google.com/spreadsheets/d/1gtu-XrLumjBFPk9kDKcJOQfxsvIE2EiuZO7IB7ab6q0/edit#gid=1446387426 Detailed reports and flaky test reporting source: https://github.com/lhotari/pulsar-flakes/tree/master/2023-09-24-to-2023-10-01 To coordinate the work, 1) please search for an existing issues or search for all flaky issues with "flaky" or the test class name (without package) in the search: https://github.com/apache/pulsar/issues?q=is%3Aopen+flaky+sort%3Aupdated-desc 2) If there isn't an issue for a particular flaky test failure that you'd like to fix, please create an issue using the "Flaky test" template at https://github.com/apache/pulsar/issues/new/choose 3) Please comment on the issue that you are working on it. Let's reduce the flakiness to make contributing to Pulsar a better experience! -Lari