Chia/TengYao/TaiJuWu, I agree that tags are a straightforward approach. In fact, my initial idea was to use tags as the isolation mechanism.
Let me try to motivate the use of a text file a bit more. Consider the "new tests" scenario where a developer has added a new integration test. If we use annotations, this means someone (the original developer, or another committer) will need to raise a PR after a few days to remove the annotation (assuming the test was stable). Eventually, I was hoping to automate or partially automate this aspect of the system. It seems simpler to write a script that modifies a plain text file compared with removing annotations in Java code. > we don't need to worry that "quarantined.txt" having out-of-date test names This could be a problem, yes. --- Maybe we can consider a two-tiered approach here: Isolation (manual) * Test is marked with tag annotation * This is a permanent state until the developers think the test is healthy again * These tests are run in a separate build step to not affect build outcomes, but gather data Quarantine (automated) * Tests leaving Isolation enter the Quarantine automatically * New integration tests enter the Quarantine automatically * Test stays in quarantine for a few days to evaluate * These tests are run in a separate build step to not affect build outcomes, but gather data * If all runs are passing, it leaves the quarantine I think with this approach, we can make the Quarantine fully data-driven and automated. Essentially, the build will query Develocity for flaky test results from the last N days and run those tests separately. WDYT? On Thu, Sep 19, 2024 at 12:49 AM 吳岱儒 <tjwu1...@gmail.com> wrote: > Hi David, > > Thank you for KIP. > > Could we include percentages for each flaky test in quarantined.txt? This > would help us prioritize which tests to resolve first. > > Additionally, I would prefer to add a flaky (JUnit) tag to the source code > so we can focus on these tests during development. > > Thanks, > TaiJuWu > > > On Thu, Sep 19, 2024 at 11:51 AM TengYao Chi <kiting...@gmail.com> wrote: > > > Hi David, > > > > Thanks for this great KIP. > > > > I really appreciate the goal of this KIP, which aims to stabilize the > build > > and improve our confidence in CI results. > > It addresses a real issue where we've become accustomed to seeing failed > > results from CI, and this is definitely not good for the Kafka community. > > > > I have a question regarding this KIP: > > It seems that we need to maintain the `quarantined.txt` files manually, > is > > that correct? > > I'm thinking this could become an issue, especially with the planned > > removal of ZK in 4.0, which will undoubtedly bring many changes to our > > codebase. > > Given that, maintaining the `quarantined.txt` files might become a pain. > > It would be nice if we could maintain it programmatically. > > > > Best Regards, > > TengYao > > > > Chia-Ping Tsai <chia7...@gmail.com> 於 2024年9月19日 週四 上午3:24寫道: > > > > > hi David > > > > > > The KIP is beautiful and I do love a rule which makes us handle those > > flaky > > > seriously. > > > > > > Regarding the "JUnit Tags", it can bring some benefits to us. > > > > > > 1. we can retry only the tests having "flaky" annotation. Other > non-flaky > > > tests should not be retryable > > > 2. we don't need to worry that "quarantined.txt" having out-of-date > test > > > names > > > 3. we can require the flaky annotation must have jira link. That means > > the > > > PR's author must create the jira link for the new flaky > > > > > > Also, we can add a gradle task to generate "quarantined.txt" file if > > needs. > > > > > > Best, > > > Chia-Ping > > > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 上午12:02寫道: > > > > > > > Hello, Kafka community! > > > > > > > > Looking at the last 7 days of GitHub, we have 59 out of 64 trunk > builds > > > > having flaky tests. Excluding timeouts (a separate issue), only 4 > > builds > > > > out of the last 7 days have failed due to excess test failures. This > is > > > > actually a slight improvement when compared with the last 28 days. > But > > > > still, this is obviously a bad situation to be in. > > > > > > > > We have previously discussed a few ideas to mitigate the impact that > > > flaky > > > > tests have on our builds. For PRs, we are actually seeing a lot of > > > > successful status checks due to our use of the Develocity test retry > > > > feature. However, the blanket use of "testRetry" is a bad practice in > > > > my opinion. It makes it far too easy for us to ignore tests that are > > only > > > > occasionally flaky. It also applies to unit tests which should never > be > > > > flaky. > > > > > > > > Another problem is that we are naturally introducing flaky tests as > new > > > > features (and tests) are introduced. Similar to feature development, > it > > > > takes some time for tests to mature and stabilize -- tests are code, > > > after > > > > all. > > > > > > > > I have written down a proposal for tracking and managing our flaky > > > tests. I > > > > have written this as a KIP even though this is an internal change. I > > did > > > so > > > > because I would like us to discuss, debate, and solidify a plan -- > and > > > > ultimately vote on it. A KIP seemed like a good fit. > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management > > > > > > > > I have back-tested this strategy (as best as I can) to our trunk > builds > > > > from the last month using data from Develocity (i.e., ge.apache.org > ). > > I > > > > looked at two scenarios. The first scenario was simply quarantining > > tests > > > > with higher than 1% flaky failures, no test re-runs were considered. > > The > > > > second scenario extends the first by allowing up to 3 total flaky > > > failures > > > > from non-quarantined tests (tests with less than 1% total flakiness). > > > > > > > > Total builds: *238* > > > > Flaky/Failed builds: *228* > > > > Flaky builds scenario 1 (quarantine only): *40* > > > > Flaky builds scenario 2 (quarantine + retry): *3* > > > > > > > > In other words, we can tackle the worst flaky failures with the > > > quarantine > > > > strategy as described in the KIP and handle the long tail of flaky > > > failures > > > > with the Develocity retry plugin. If we only had 3 failing trunk > builds > > > per > > > > month to investigate, I'd say we were in pretty good shape :) > > > > > > > > Let me know what you think! > > > > > > > > Cheers, > > > > David A > > > > > > > > > > -- David Arthur