Hello, Kafka community! Looking at the last 7 days of GitHub, we have 59 out of 64 trunk builds having flaky tests. Excluding timeouts (a separate issue), only 4 builds out of the last 7 days have failed due to excess test failures. This is actually a slight improvement when compared with the last 28 days. But still, this is obviously a bad situation to be in.
We have previously discussed a few ideas to mitigate the impact that flaky tests have on our builds. For PRs, we are actually seeing a lot of successful status checks due to our use of the Develocity test retry feature. However, the blanket use of "testRetry" is a bad practice in my opinion. It makes it far too easy for us to ignore tests that are only occasionally flaky. It also applies to unit tests which should never be flaky. Another problem is that we are naturally introducing flaky tests as new features (and tests) are introduced. Similar to feature development, it takes some time for tests to mature and stabilize -- tests are code, after all. I have written down a proposal for tracking and managing our flaky tests. I have written this as a KIP even though this is an internal change. I did so because I would like us to discuss, debate, and solidify a plan -- and ultimately vote on it. A KIP seemed like a good fit. https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management I have back-tested this strategy (as best as I can) to our trunk builds from the last month using data from Develocity (i.e., ge.apache.org). I looked at two scenarios. The first scenario was simply quarantining tests with higher than 1% flaky failures, no test re-runs were considered. The second scenario extends the first by allowing up to 3 total flaky failures from non-quarantined tests (tests with less than 1% total flakiness). Total builds: *238* Flaky/Failed builds: *228* Flaky builds scenario 1 (quarantine only): *40* Flaky builds scenario 2 (quarantine + retry): *3* In other words, we can tackle the worst flaky failures with the quarantine strategy as described in the KIP and handle the long tail of flaky failures with the Develocity retry plugin. If we only had 3 failing trunk builds per month to investigate, I'd say we were in pretty good shape :) Let me know what you think! Cheers, David A