Q0/Q1: If we automatically quarantine integration tests that have had recent flakiness, I worry we will just ignore them. However, the alternative is to let them stay in the main suite while being flaky which could start to fail builds.
I think the question comes down to: what is the outcome if we don't notice a new flaky? If we automatically move to quarantine, the risk is we continue not noticing and the bug persists longer. If we let it live in the main suite, we will probably start getting flaky failures on trunk. There is a chance that the flakiness stays below the threshold for the main suite and it goes unnoticed. I prefer using the explicit tags because it also makes it obvious what tests are flaky in the source code. Q2: There are two suites: main and quarantine. They will both be run on each trunk commit, but with different develocity retry settings. Also, a failure of the quarantine suite should not fail the overall build. This is how we will gather the 7 days of data for quarantined tests. We can also add a nightly workflow run to ensure sufficient test data. Q3: You're correct that the CI for our PRs will only run changed module's tests. If desired, we can run the quarantine suite with --rerun-tasks to force them to run. Longer term, I would like to have a small set of integration tests that we always run. These would help with the scenario you describe ("clients" change affecting "core", or vica versa). I think this expands beyond the scope of the KIP, but it's good to note. On Sat, Sep 21, 2024 at 9:01 AM Chia-Ping Tsai <chia7...@apache.org> wrote: > hi David > > I have some questions for the latest KIP. > > Q0: > > If both git and gradle develocity can be our database to query flaky, > maybe we don’t need the flaky tag? As you described before, that can save > the overhead of adding/removing the tags. Make senses? > > Q1: > > the main test suite should include following tests: > > 1. all unit test ( according to José comment) > 2. the integration tests which have no flaky failures on trunk for 7 days > with a minimum of 10 builds > 3. the new integration tests added by “this” PR > > Also, those tests are assumed to be stable so they all don’t need retry, > right? > > Q2: > > The new integration tests will be excluded by “next” commit on main test > suite, and it will be added to main test suite only if it pass the > quarantine test suite, right? If so, “ quarantine test suite” will be > executed daily (or 12h?) to collect enough test result for those new > integration tests? > > Q3: > > IIRC, github CI run the tests only for the modules changed by PR. That is > used to reduce the usage of GitHub runner. However, it is hard to ensure > the “change” of client module won’t impact the tests of core modules. > Hence, should we run all module tests if some important modules are touched > by the PR? Noted the flaky integration tests are still quarantined. > > Best, > Chia-Ping > > On 2024/09/18 16:02:31 David Arthur wrote: > > Hello, Kafka community! > > > > Looking at the last 7 days of GitHub, we have 59 out of 64 trunk builds > > having flaky tests. Excluding timeouts (a separate issue), only 4 builds > > out of the last 7 days have failed due to excess test failures. This is > > actually a slight improvement when compared with the last 28 days. But > > still, this is obviously a bad situation to be in. > > > > We have previously discussed a few ideas to mitigate the impact that > flaky > > tests have on our builds. For PRs, we are actually seeing a lot of > > successful status checks due to our use of the Develocity test retry > > feature. However, the blanket use of "testRetry" is a bad practice in > > my opinion. It makes it far too easy for us to ignore tests that are only > > occasionally flaky. It also applies to unit tests which should never be > > flaky. > > > > Another problem is that we are naturally introducing flaky tests as new > > features (and tests) are introduced. Similar to feature development, it > > takes some time for tests to mature and stabilize -- tests are code, > after > > all. > > > > I have written down a proposal for tracking and managing our flaky > tests. I > > have written this as a KIP even though this is an internal change. I did > so > > because I would like us to discuss, debate, and solidify a plan -- and > > ultimately vote on it. A KIP seemed like a good fit. > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management > > > > I have back-tested this strategy (as best as I can) to our trunk builds > > from the last month using data from Develocity (i.e., ge.apache.org). I > > looked at two scenarios. The first scenario was simply quarantining tests > > with higher than 1% flaky failures, no test re-runs were considered. The > > second scenario extends the first by allowing up to 3 total flaky > failures > > from non-quarantined tests (tests with less than 1% total flakiness). > > > > Total builds: *238* > > Flaky/Failed builds: *228* > > Flaky builds scenario 1 (quarantine only): *40* > > Flaky builds scenario 2 (quarantine + retry): *3* > > > > In other words, we can tackle the worst flaky failures with the > quarantine > > strategy as described in the KIP and handle the long tail of flaky > failures > > with the Develocity retry plugin. If we only had 3 failing trunk builds > per > > month to investigate, I'd say we were in pretty good shape :) > > > > Let me know what you think! > > > > Cheers, > > David A > > > -- David Arthur