hi José and David Maybe the unit test should be excluded from the "Flaky Test Management" and "retry"
1. unit test can NOT have "@Flaky" (if we decided to use annotation in the flaky management) 2. CI will enable the retry only for integration test. This can be addressed by adding a new flag "allowRetryUnitTest" and add filter to "retry plugin" the option 2 can be added to our CI easily. WDYT? Best, Chia-Ping David Arthur <mum...@gmail.com> 於 2024年9月21日 週六 上午10:27寫道: > José, > > By default, unit tests will not be eligible for the automatic new test > quarantine. Like you said, if a unit test fails, it indicates a problem. > > Perhaps we could auto-fail a test marked as "flaky" that is not also tagged > as "integration"? > > > > Chia-Ping, > > >IMHO, the rule should be "unit test should always be deterministic > (non-retryable)" > > I agree :) > > On Fri, Sep 20, 2024 at 9:17 PM Chia-Ping Tsai <chia7...@gmail.com> wrote: > > > > Can modules opt out of this feature? For example, the raft module > > doesn't have any integration tests and all of the tests are meant to > > be deterministic. It would be dangerous to the protocol's correctness > > and the consistency of the cluster metadata to allow contributors to > > mark tests as flaky in the raft module instead of investigating the > > source of the failure. > > > > IMHO, the rule should be "unit test should always be deterministic > > (non-retryable)" > > > > Best, > > Chia-Ping > > > > José Armando García Sancio <jsan...@confluent.io.invalid> 於 2024年9月21日 > 週六 > > 上午9:01寫道: > > > > > Thanks for the proposal David. > > > > > > Can modules opt out of this feature? For example, the raft module > > > doesn't have any integration tests and all of the tests are meant to > > > be deterministic. It would be dangerous to the protocol's correctness > > > and the consistency of the cluster metadata to allow contributors to > > > mark tests as flaky in the raft module instead of investigating the > > > source of the failure. > > > > > > On Fri, Sep 20, 2024 at 12:07 PM David Arthur <mum...@gmail.com> > wrote: > > > > > > > > Chia-Ping, I think something like that can work. I was also thinking > > > about > > > > extracting the test names during trunk builds using Gradle and > storing > > > that > > > > somewhere. I think it's fair to say we can derive this data from Git > > and > > > > Develocity. We can probably figure out the implementation details > later > > > on > > > > :) > > > > > > > > I've updated the KIP to describe the two-tiered approach as well as > > > > including a diagram. > > > > > > > > -David > > > > > > > > > > > > On Thu, Sep 19, 2024 at 12:40 PM Chia-Ping Tsai <chia7...@gmail.com> > > > wrote: > > > > > > > > > > However, this doesn't help with the newly added tests that > > introduce > > > > > flakiness. For this, we need some way > > > > > to detect when a new test has been added. Still thinking through > > > this... > > > > > > > > > > Maybe we can use git + gradle develocity to address it. > > > > > > > > > > 1) list the files changed recently (for example: git diff > --name-only > > > "@{7 > > > > > days ago}") > > > > > 2) use the test-related filename to query Gradle Develocity with "7 > > > days > > > > > ago" > > > > > 3) the test cases having fewer builds are viewed as "new tests" > > > > > > > > > > WDYT? > > > > > > > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 下午11:56寫道: > > > > > > > > > > > TengYao, > > > > > > > > > > > > > These two mechanisms are independent. We could manually remove > a > > > tag > > > > > from > > > > > > a > > > > > > test, but at the same time, it might still be quarantined. > > > > > > I know the above situation might sound weird, but I just want to > > > > > understand > > > > > > how it would work. > > > > > > > > > > > > If we remove a tag from a test, we are signaling that we *think* > it > > > is > > > > > now > > > > > > stable. However, until we have data > > > > > > to prove that, we can run the test in the quarantine. This helps > > > with the > > > > > > situation where there are multiple > > > > > > sources of flakiness and we are making iterative progress on > > > improving > > > > > it. > > > > > > > > > > > > The alternative is to keep the fix in a PR until the developer > gets > > > > > enough > > > > > > signal from the PR builds to be > > > > > > convinced that the flakiness is gone. I mention in the KIP that > > this > > > > > > approach has issues. > > > > > > > > > > > > > > > > > > Chia-Ping, > > > > > > > > > > > > > If so, could we query the Gradle develocity to get flaky and > then > > > make > > > > > > only those flaky > > > > > > retryable in CI? > > > > > > > > > > > > This can work for existing flaky tests. We can query Develocity > to > > > find > > > > > > flaky tests in the last N days > > > > > > and run only those tests with retry. This is what I meant by > "data > > > > > driven" > > > > > > for the quarantine. So, no need to track > > > > > > "quarantined.txt" file separately -- we can derive this data at > > build > > > > > time > > > > > > from Develocity. > > > > > > > > > > > > However, this doesn't help with the newly added tests that > > introduce > > > > > > flakiness. For this, we need some way > > > > > > to detect when a new test has been added. Still thinking through > > > this... > > > > > > > > > > > > -David > > > > > > > > > > > > On Thu, Sep 19, 2024 at 11:33 AM Chia-Ping Tsai < > > chia7...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > hi David > > > > > > > > > > > > > > The two-tiered approach is interesting and I have questions > > > similar to > > > > > > > TengYao. > > > > > > > > > > > > > > BUT, go back to the usage of quarantine and isolation. It seems > > to > > > me > > > > > > they > > > > > > > are used to make our CI not be noised by the flaky, right? If > so, > > > could > > > > > > we > > > > > > > query the Gradle develocity to get flaky and then make only > those > > > flaky > > > > > > > retryable in CI? That means a non-flaky (stable) test must pass > > > without > > > > > > > retry. > > > > > > > > > > > > > > This approach is more simple (I guess), and it can isolate the > > > flaky > > > > > (by > > > > > > > retry) from CI. If a flaky can't pass after retry, I feel it > is a > > > > > > critical > > > > > > > failed test rather than just a flaky. On top of that, this > > approach > > > > > does > > > > > > > not need an extra file ("quarantined.txt") in git control and > it > > > does > > > > > not > > > > > > > need to change any code of test classes. > > > > > > > > > > > > > > Best, > > > > > > > Chia-Ping > > > > > > > > > > > > > > TengYao Chi <kiting...@gmail.com> 於 2024年9月19日 週四 下午11:19寫道: > > > > > > > > > > > > > > > Hi David, > > > > > > > > > > > > > > > > Thanks for the explanation. > > > > > > > > > > > > > > > > I like this two-tiered approach, which gives us more > > flexibility > > > to > > > > > > > handle > > > > > > > > flaky tests. > > > > > > > > > > > > > > > > The following is my understanding of how it works; please > > > correct me > > > > > if > > > > > > > I'm > > > > > > > > wrong: > > > > > > > > If we adopt the two-tiered approach, the test might have two > > > > > > > > states.(Isolated by developer, Quarantined automatically). > > > > > > > > These two mechanisms are independent. We could manually > remove > > a > > > tag > > > > > > > from a > > > > > > > > test, but at the same time, it might still be quarantined. > > > > > > > > I know the above situation might sound weird, but I just want > > to > > > > > > > understand > > > > > > > > how it would work. > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > TengYao > > > > > > > > > > > > > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 下午10:07寫道: > > > > > > > > > > > > > > > > > Chia/TengYao/TaiJuWu, I agree that tags are a > straightforward > > > > > > approach. > > > > > > > > In > > > > > > > > > fact, my initial idea was to use tags as the isolation > > > mechanism. > > > > > > > > > > > > > > > > > > Let me try to motivate the use of a text file a bit more. > > > > > > > > > > > > > > > > > > Consider the "new tests" scenario where a developer has > added > > > a new > > > > > > > > > integration test. If we use annotations, this means someone > > > (the > > > > > > > original > > > > > > > > > developer, or another committer) will need to raise a PR > > after > > > a > > > > > few > > > > > > > days > > > > > > > > > to remove the annotation (assuming the test was stable). > > > > > Eventually, > > > > > > I > > > > > > > > was > > > > > > > > > hoping to automate or partially automate this aspect of the > > > system. > > > > > > It > > > > > > > > > seems simpler to write a script that modifies a plain text > > file > > > > > > > compared > > > > > > > > > with removing annotations in Java code. > > > > > > > > > > > > > > > > > > > we don't need to worry that "quarantined.txt" having > > > out-of-date > > > > > > test > > > > > > > > > names > > > > > > > > > > > > > > > > > > This could be a problem, yes. > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > > > > Maybe we can consider a two-tiered approach here: > > > > > > > > > > > > > > > > > > Isolation (manual) > > > > > > > > > * Test is marked with tag annotation > > > > > > > > > * This is a permanent state until the developers think the > > > test is > > > > > > > > healthy > > > > > > > > > again > > > > > > > > > * These tests are run in a separate build step to not > affect > > > build > > > > > > > > > outcomes, but gather data > > > > > > > > > > > > > > > > > > > > > > > > > > > Quarantine (automated) > > > > > > > > > * Tests leaving Isolation enter the Quarantine > automatically > > > > > > > > > * New integration tests enter the Quarantine automatically > > > > > > > > > * Test stays in quarantine for a few days to evaluate > > > > > > > > > * These tests are run in a separate build step to not > affect > > > build > > > > > > > > > outcomes, but gather data > > > > > > > > > * If all runs are passing, it leaves the quarantine > > > > > > > > > > > > > > > > > > I think with this approach, we can make the Quarantine > fully > > > > > > > data-driven > > > > > > > > > and automated. Essentially, the build will query Develocity > > for > > > > > flaky > > > > > > > > test > > > > > > > > > results from the last N days and run those tests > separately. > > > > > > > > > > > > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > On Thu, Sep 19, 2024 at 12:49 AM 吳岱儒 <tjwu1...@gmail.com> > > > wrote: > > > > > > > > > > > > > > > > > > > Hi David, > > > > > > > > > > > > > > > > > > > > Thank you for KIP. > > > > > > > > > > > > > > > > > > > > Could we include percentages for each flaky test in > > > > > > quarantined.txt? > > > > > > > > This > > > > > > > > > > would help us prioritize which tests to resolve first. > > > > > > > > > > > > > > > > > > > > Additionally, I would prefer to add a flaky (JUnit) tag > to > > > the > > > > > > source > > > > > > > > > code > > > > > > > > > > so we can focus on these tests during development. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > TaiJuWu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Sep 19, 2024 at 11:51 AM TengYao Chi < > > > > > kiting...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi David, > > > > > > > > > > > > > > > > > > > > > > Thanks for this great KIP. > > > > > > > > > > > > > > > > > > > > > > I really appreciate the goal of this KIP, which aims to > > > > > stabilize > > > > > > > the > > > > > > > > > > build > > > > > > > > > > > and improve our confidence in CI results. > > > > > > > > > > > It addresses a real issue where we've become accustomed > > to > > > > > seeing > > > > > > > > > failed > > > > > > > > > > > results from CI, and this is definitely not good for > the > > > Kafka > > > > > > > > > community. > > > > > > > > > > > > > > > > > > > > > > I have a question regarding this KIP: > > > > > > > > > > > It seems that we need to maintain the `quarantined.txt` > > > files > > > > > > > > manually, > > > > > > > > > > is > > > > > > > > > > > that correct? > > > > > > > > > > > I'm thinking this could become an issue, especially > with > > > the > > > > > > > planned > > > > > > > > > > > removal of ZK in 4.0, which will undoubtedly bring many > > > changes > > > > > > to > > > > > > > > our > > > > > > > > > > > codebase. > > > > > > > > > > > Given that, maintaining the `quarantined.txt` files > might > > > > > become > > > > > > a > > > > > > > > > pain. > > > > > > > > > > > It would be nice if we could maintain it > > programmatically. > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > > > > TengYao > > > > > > > > > > > > > > > > > > > > > > Chia-Ping Tsai <chia7...@gmail.com> 於 2024年9月19日 週四 > > > 上午3:24寫道: > > > > > > > > > > > > > > > > > > > > > > > hi David > > > > > > > > > > > > > > > > > > > > > > > > The KIP is beautiful and I do love a rule which makes > > us > > > > > handle > > > > > > > > those > > > > > > > > > > > flaky > > > > > > > > > > > > seriously. > > > > > > > > > > > > > > > > > > > > > > > > Regarding the "JUnit Tags", it can bring some > benefits > > > to us. > > > > > > > > > > > > > > > > > > > > > > > > 1. we can retry only the tests having "flaky" > > annotation. > > > > > Other > > > > > > > > > > non-flaky > > > > > > > > > > > > tests should not be retryable > > > > > > > > > > > > 2. we don't need to worry that "quarantined.txt" > having > > > > > > > out-of-date > > > > > > > > > > test > > > > > > > > > > > > names > > > > > > > > > > > > 3. we can require the flaky annotation must have jira > > > link. > > > > > > That > > > > > > > > > means > > > > > > > > > > > the > > > > > > > > > > > > PR's author must create the jira link for the new > flaky > > > > > > > > > > > > > > > > > > > > > > > > Also, we can add a gradle task to generate > > > "quarantined.txt" > > > > > > file > > > > > > > > if > > > > > > > > > > > needs. > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > Chia-Ping > > > > > > > > > > > > > > > > > > > > > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 > > > 上午12:02寫道: > > > > > > > > > > > > > > > > > > > > > > > > > Hello, Kafka community! > > > > > > > > > > > > > > > > > > > > > > > > > > Looking at the last 7 days of GitHub, we have 59 > out > > > of 64 > > > > > > > trunk > > > > > > > > > > builds > > > > > > > > > > > > > having flaky tests. Excluding timeouts (a separate > > > issue), > > > > > > > only 4 > > > > > > > > > > > builds > > > > > > > > > > > > > out of the last 7 days have failed due to excess > test > > > > > > failures. > > > > > > > > > This > > > > > > > > > > is > > > > > > > > > > > > > actually a slight improvement when compared with > the > > > last > > > > > 28 > > > > > > > > days. > > > > > > > > > > But > > > > > > > > > > > > > still, this is obviously a bad situation to be in. > > > > > > > > > > > > > > > > > > > > > > > > > > We have previously discussed a few ideas to > mitigate > > > the > > > > > > impact > > > > > > > > > that > > > > > > > > > > > > flaky > > > > > > > > > > > > > tests have on our builds. For PRs, we are actually > > > seeing a > > > > > > lot > > > > > > > > of > > > > > > > > > > > > > successful status checks due to our use of the > > > Develocity > > > > > > test > > > > > > > > > retry > > > > > > > > > > > > > feature. However, the blanket use of "testRetry" > is a > > > bad > > > > > > > > practice > > > > > > > > > in > > > > > > > > > > > > > my opinion. It makes it far too easy for us to > ignore > > > tests > > > > > > > that > > > > > > > > > are > > > > > > > > > > > only > > > > > > > > > > > > > occasionally flaky. It also applies to unit tests > > which > > > > > > should > > > > > > > > > never > > > > > > > > > > be > > > > > > > > > > > > > flaky. > > > > > > > > > > > > > > > > > > > > > > > > > > Another problem is that we are naturally > introducing > > > flaky > > > > > > > tests > > > > > > > > as > > > > > > > > > > new > > > > > > > > > > > > > features (and tests) are introduced. Similar to > > feature > > > > > > > > > development, > > > > > > > > > > it > > > > > > > > > > > > > takes some time for tests to mature and stabilize > -- > > > tests > > > > > > are > > > > > > > > > code, > > > > > > > > > > > > after > > > > > > > > > > > > > all. > > > > > > > > > > > > > > > > > > > > > > > > > > I have written down a proposal for tracking and > > > managing > > > > > our > > > > > > > > flaky > > > > > > > > > > > > tests. I > > > > > > > > > > > > > have written this as a KIP even though this is an > > > internal > > > > > > > > change. > > > > > > > > > I > > > > > > > > > > > did > > > > > > > > > > > > so > > > > > > > > > > > > > because I would like us to discuss, debate, and > > > solidify a > > > > > > plan > > > > > > > > -- > > > > > > > > > > and > > > > > > > > > > > > > ultimately vote on it. A KIP seemed like a good > fit. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management > > > > > > > > > > > > > > > > > > > > > > > > > > I have back-tested this strategy (as best as I can) > > to > > > our > > > > > > > trunk > > > > > > > > > > builds > > > > > > > > > > > > > from the last month using data from Develocity > (i.e., > > > > > > > > > ge.apache.org > > > > > > > > > > ). > > > > > > > > > > > I > > > > > > > > > > > > > looked at two scenarios. The first scenario was > > simply > > > > > > > > quarantining > > > > > > > > > > > tests > > > > > > > > > > > > > with higher than 1% flaky failures, no test re-runs > > > were > > > > > > > > > considered. > > > > > > > > > > > The > > > > > > > > > > > > > second scenario extends the first by allowing up > to 3 > > > total > > > > > > > flaky > > > > > > > > > > > > failures > > > > > > > > > > > > > from non-quarantined tests (tests with less than 1% > > > total > > > > > > > > > flakiness). > > > > > > > > > > > > > > > > > > > > > > > > > > Total builds: *238* > > > > > > > > > > > > > Flaky/Failed builds: *228* > > > > > > > > > > > > > Flaky builds scenario 1 (quarantine only): *40* > > > > > > > > > > > > > Flaky builds scenario 2 (quarantine + retry): *3* > > > > > > > > > > > > > > > > > > > > > > > > > > In other words, we can tackle the worst flaky > > failures > > > with > > > > > > the > > > > > > > > > > > > quarantine > > > > > > > > > > > > > strategy as described in the KIP and handle the > long > > > tail > > > > > of > > > > > > > > flaky > > > > > > > > > > > > failures > > > > > > > > > > > > > with the Develocity retry plugin. If we only had 3 > > > failing > > > > > > > trunk > > > > > > > > > > builds > > > > > > > > > > > > per > > > > > > > > > > > > > month to investigate, I'd say we were in pretty > good > > > shape > > > > > :) > > > > > > > > > > > > > > > > > > > > > > > > > > Let me know what you think! > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > David A > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > David Arthur > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > David Arthur > > > > > > > > > > > > > > > > > > > > > > > -- > > > > David Arthur > > > > > > > > > > > > -- > > > -José > > > > > > > > -- > David Arthur >