Re: [DISCUSS] KIP-1090 Flaky Tests 👻

Chia-Ping Tsai Sat, 21 Sep 2024 06:00:42 -0700

hi David

I have some questions for the latest KIP.

Q0:

If both git and gradle develocity can be our database to query flaky, maybe we 
don’t need the flaky tag? As you described before, that can save the overhead 
of adding/removing the tags. Make senses?

Q1:

the main test suite should include following tests:

1. all unit test ( according to José comment)
2. the integration tests which have no flaky failures on trunk for 7 days with 
a minimum of 10 builds
3. the new integration tests added by “this” PR

Also, those tests are assumed to be stable so they all don’t need retry, right?

Q2:

The new integration tests will be excluded by “next” commit on main test suite, 
and it will be added to main test suite only if it pass the quarantine test 
suite, right? If so, “ quarantine test suite” will be executed daily (or 12h?) 
to collect enough test result for those new integration tests?

Q3:

IIRC, github CI run the tests only for the modules changed by PR. That is used 
to reduce the usage of GitHub runner. However, it is hard to ensure the 
“change” of client module won’t impact the tests of core modules. Hence, should 
we run all module tests if some important modules are touched by the PR?  Noted 
the flaky integration tests are still quarantined.

Best,
Chia-Ping

On 2024/09/18 16:02:31 David Arthur wrote:
> Hello, Kafka community!
> 
> Looking at the last 7 days of GitHub, we have 59 out of 64 trunk builds
> having flaky tests. Excluding timeouts (a separate issue), only 4 builds
> out of the last 7 days have failed due to excess test failures. This is
> actually a slight improvement when compared with the last 28 days. But
> still, this is obviously a bad situation to be in.
> 
> We have previously discussed a few ideas to mitigate the impact that flaky
> tests have on our builds. For PRs, we are actually seeing a lot of
> successful status checks due to our use of the Develocity test retry
> feature. However, the blanket use of "testRetry" is a bad practice in
> my opinion. It makes it far too easy for us to ignore tests that are only
> occasionally flaky. It also applies to unit tests which should never be
> flaky.
> 
> Another problem is that we are naturally introducing flaky tests as new
> features (and tests) are introduced. Similar to feature development, it
> takes some time for tests to mature and stabilize -- tests are code, after
> all.
> 
> I have written down a proposal for tracking and managing our flaky tests. I
> have written this as a KIP even though this is an internal change. I did so
> because I would like us to discuss, debate, and solidify a plan -- and
> ultimately vote on it. A KIP seemed like a good fit.
> 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management
> 
> I have back-tested this strategy (as best as I can) to our trunk builds
> from the last month using data from Develocity (i.e., ge.apache.org). I
> looked at two scenarios. The first scenario was simply quarantining tests
> with higher than 1% flaky failures, no test re-runs were considered. The
> second scenario extends the first by allowing up to 3 total flaky failures
> from non-quarantined tests (tests with less than 1% total flakiness).
> 
> Total builds: *238*
> Flaky/Failed builds: *228*
> Flaky builds scenario 1 (quarantine only): *40*
> Flaky builds scenario 2 (quarantine + retry): *3*
> 
> In other words, we can tackle the worst flaky failures with the quarantine
> strategy as described in the KIP and handle the long tail of flaky failures
> with the Develocity retry plugin. If we only had 3 failing trunk builds per
> month to investigate, I'd say we were in pretty good shape :)
> 
> Let me know what you think!
> 
> Cheers,
> David A
>

Re: [DISCUSS] KIP-1090 Flaky Tests 👻

Reply via email to