Hello, We have had a solution for moving flaky tests to the flaky test group for quite some time. However, this solution did not work for individual test methods. This issue has now been resolved in the CI build. I have also moved the most flaky tests to the flaky test group, as they often required several retries before all flaky tests passed on an individual CI build. This problem has worsened over the past month.
The PR containing the changes is available at https://github.com/apache/pulsar/pull/22433. All tests that were moved have been reported as flaky and are listed in the description of that issue. The flaky tests build job will fail when a flaky test fails, but this will not prevent a PR from being merged. If a test is more or less useless, it can either be removed or moved to the quarantine group. The errors for the quarantine group are ignored, but the test report will still include the failures. The quarantine solution has been available for a long time, so it was not added in this PR. When a test method is moved to the flaky test group, the TestNG annotation should be something like @Test(groups = "flaky"). Within the test class, it is crucial to ensure that all BeforeClass, BeforeMethod, AfterMethod, and AfterClass annotations contain "(alwaysRun = true)", for example, @BeforeClass(alwaysRun = true). Without this change, the before or after method will not execute when the flaky test method is run, which could lead to NPEs or other odd issues during the execution of the test method in the flaky test group. Please let me know if you have any concerns about this change. I hope that the flaky tests will eventually be resolved. However, we should be more rigorous about moving flaky tests to the flaky test group in the future. We are spending a lot of time and build resources when we allow flaky tests to be part of the regular test runs. Looking forward to more contributions in this area, -Lari