[DISCUSS] KIP-1090 Flaky Tests 👻

David Arthur Wed, 18 Sep 2024 09:03:03 -0700

Hello, Kafka community!

Looking at the last 7 days of GitHub, we have 59 out of 64 trunk builds
having flaky tests. Excluding timeouts (a separate issue), only 4 builds
out of the last 7 days have failed due to excess test failures. This is
actually a slight improvement when compared with the last 28 days. But
still, this is obviously a bad situation to be in.


We have previously discussed a few ideas to mitigate the impact that flaky
tests have on our builds. For PRs, we are actually seeing a lot of
successful status checks due to our use of the Develocity test retry
feature. However, the blanket use of "testRetry" is a bad practice in
my opinion. It makes it far too easy for us to ignore tests that are only
occasionally flaky. It also applies to unit tests which should never be
flaky.

Another problem is that we are naturally introducing flaky tests as new
features (and tests) are introduced. Similar to feature development, it
takes some time for tests to mature and stabilize -- tests are code, after
all.

I have written down a proposal for tracking and managing our flaky tests. I
have written this as a KIP even though this is an internal change. I did so
because I would like us to discuss, debate, and solidify a plan -- and
ultimately vote on it. A KIP seemed like a good fit.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management

I have back-tested this strategy (as best as I can) to our trunk builds
from the last month using data from Develocity (i.e., ge.apache.org). I
looked at two scenarios. The first scenario was simply quarantining tests
with higher than 1% flaky failures, no test re-runs were considered. The
second scenario extends the first by allowing up to 3 total flaky failures
from non-quarantined tests (tests with less than 1% total flakiness).

Total builds: *238*
Flaky/Failed builds: *228*
Flaky builds scenario 1 (quarantine only): *40*
Flaky builds scenario 2 (quarantine + retry): *3*

In other words, we can tackle the worst flaky failures with the quarantine
strategy as described in the KIP and handle the long tail of flaky failures
with the Develocity retry plugin. If we only had 3 failing trunk builds per
month to investigate, I'd say we were in pretty good shape :)

Let me know what you think!

Cheers,
David A

[DISCUSS] KIP-1090 Flaky Tests 👻

Reply via email to