Re: [DISCUSS] KIP-1090 Flaky Tests 👻

TengYao Chi Thu, 19 Sep 2024 08:19:29 -0700

Hi David,

Thanks for the explanation.


I like this two-tiered approach, which gives us more flexibility to handle
flaky tests.

The following is my understanding of how it works; please correct me if I'm
wrong:
If we adopt the two-tiered approach, the test might have two
states.(Isolated by developer, Quarantined automatically).
These two mechanisms are independent. We could manually remove a tag from a
test, but at the same time, it might still be quarantined.
I know the above situation might sound weird, but I just want to understand
how it would work.

Best Regards,
TengYao

David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 下午10:07寫道：

> Chia/TengYao/TaiJuWu, I agree that tags are a straightforward approach. In
> fact, my initial idea was to use tags as the isolation mechanism.
>
> Let me try to motivate the use of a text file a bit more.
>
> Consider the "new tests" scenario where a developer has added a new
> integration test. If we use annotations, this means someone (the original
> developer, or another committer) will need to raise a PR after a few days
> to remove the annotation (assuming the test was stable). Eventually, I was
> hoping to automate or partially automate this aspect of the system. It
> seems simpler to write a script that modifies a plain text file compared
> with removing annotations in Java code.
>
> > we don't need to worry that "quarantined.txt" having out-of-date test
> names
>
> This could be a problem, yes.
>
> ---
>
> Maybe we can consider a two-tiered approach here:
>
> Isolation (manual)
> * Test is marked with tag annotation
> * This is a permanent state until the developers think the test is healthy
> again
> * These tests are run in a separate build step to not affect build
> outcomes, but gather data
>
>
> Quarantine (automated)
> * Tests leaving Isolation enter the Quarantine automatically
> * New integration tests enter the Quarantine automatically
> * Test stays in quarantine for a few days to evaluate
> * These tests are run in a separate build step to not affect build
> outcomes, but gather data
> * If all runs are passing, it leaves the quarantine
>
> I think with this approach, we can make the Quarantine fully data-driven
> and automated. Essentially, the build will query Develocity for flaky test
> results from the last N days and run those tests separately.
>
>
> WDYT?
>
> On Thu, Sep 19, 2024 at 12:49 AM 吳岱儒 <tjwu1...@gmail.com> wrote:
>
> > Hi David,
> >
> > Thank you for KIP.
> >
> > Could we include percentages for each flaky test in quarantined.txt? This
> > would help us prioritize which tests to resolve first.
> >
> > Additionally, I would prefer to add a flaky (JUnit) tag to the source
> code
> > so we can focus on these tests during development.
> >
> > Thanks,
> > TaiJuWu
> >
> >
> > On Thu, Sep 19, 2024 at 11:51 AM TengYao Chi <kiting...@gmail.com>
> wrote:
> >
> > > Hi David,
> > >
> > > Thanks for this great KIP.
> > >
> > > I really appreciate the goal of this KIP, which aims to stabilize the
> > build
> > > and improve our confidence in CI results.
> > > It addresses a real issue where we've become accustomed to seeing
> failed
> > > results from CI, and this is definitely not good for the Kafka
> community.
> > >
> > > I have a question regarding this KIP:
> > > It seems that we need to maintain the `quarantined.txt` files manually,
> > is
> > > that correct?
> > > I'm thinking this could become an issue, especially with the planned
> > > removal of ZK in 4.0, which will undoubtedly bring many changes to our
> > > codebase.
> > > Given that, maintaining the `quarantined.txt` files might become a
> pain.
> > > It would be nice if we could maintain it programmatically.
> > >
> > > Best Regards,
> > > TengYao
> > >
> > > Chia-Ping Tsai <chia7...@gmail.com> 於 2024年9月19日 週四 上午3:24寫道：
> > >
> > > > hi David
> > > >
> > > > The KIP is beautiful and I do love a rule which makes us handle those
> > > flaky
> > > > seriously.
> > > >
> > > > Regarding the "JUnit Tags", it can bring some benefits to us.
> > > >
> > > > 1. we can retry only the tests having "flaky" annotation. Other
> > non-flaky
> > > > tests should not be retryable
> > > > 2. we don't need to worry that "quarantined.txt" having out-of-date
> > test
> > > > names
> > > > 3. we can require the flaky annotation must have jira link. That
> means
> > > the
> > > > PR's author must create the jira link for the new flaky
> > > >
> > > > Also, we can add a gradle task to generate "quarantined.txt" file if
> > > needs.
> > > >
> > > > Best,
> > > > Chia-Ping
> > > >
> > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 上午12:02寫道：
> > > >
> > > > > Hello, Kafka community!
> > > > >
> > > > > Looking at the last 7 days of GitHub, we have 59 out of 64 trunk
> > builds
> > > > > having flaky tests. Excluding timeouts (a separate issue), only 4
> > > builds
> > > > > out of the last 7 days have failed due to excess test failures.
> This
> > is
> > > > > actually a slight improvement when compared with the last 28 days.
> > But
> > > > > still, this is obviously a bad situation to be in.
> > > > >
> > > > > We have previously discussed a few ideas to mitigate the impact
> that
> > > > flaky
> > > > > tests have on our builds. For PRs, we are actually seeing a lot of
> > > > > successful status checks due to our use of the Develocity test
> retry
> > > > > feature. However, the blanket use of "testRetry" is a bad practice
> in
> > > > > my opinion. It makes it far too easy for us to ignore tests that
> are
> > > only
> > > > > occasionally flaky. It also applies to unit tests which should
> never
> > be
> > > > > flaky.
> > > > >
> > > > > Another problem is that we are naturally introducing flaky tests as
> > new
> > > > > features (and tests) are introduced. Similar to feature
> development,
> > it
> > > > > takes some time for tests to mature and stabilize -- tests are
> code,
> > > > after
> > > > > all.
> > > > >
> > > > > I have written down a proposal for tracking and managing our flaky
> > > > tests. I
> > > > > have written this as a KIP even though this is an internal change.
> I
> > > did
> > > > so
> > > > > because I would like us to discuss, debate, and solidify a plan --
> > and
> > > > > ultimately vote on it. A KIP seemed like a good fit.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management
> > > > >
> > > > > I have back-tested this strategy (as best as I can) to our trunk
> > builds
> > > > > from the last month using data from Develocity (i.e.,
> ge.apache.org
> > ).
> > > I
> > > > > looked at two scenarios. The first scenario was simply quarantining
> > > tests
> > > > > with higher than 1% flaky failures, no test re-runs were
> considered.
> > > The
> > > > > second scenario extends the first by allowing up to 3 total flaky
> > > > failures
> > > > > from non-quarantined tests (tests with less than 1% total
> flakiness).
> > > > >
> > > > > Total builds: *238*
> > > > > Flaky/Failed builds: *228*
> > > > > Flaky builds scenario 1 (quarantine only): *40*
> > > > > Flaky builds scenario 2 (quarantine + retry): *3*
> > > > >
> > > > > In other words, we can tackle the worst flaky failures with the
> > > > quarantine
> > > > > strategy as described in the KIP and handle the long tail of flaky
> > > > failures
> > > > > with the Develocity retry plugin. If we only had 3 failing trunk
> > builds
> > > > per
> > > > > month to investigate, I'd say we were in pretty good shape :)
> > > > >
> > > > > Let me know what you think!
> > > > >
> > > > > Cheers,
> > > > > David A
> > > > >
> > > >
> > >
> >
>
>
> --
> David Arthur
>

Re: [DISCUSS] KIP-1090 Flaky Tests 👻

Reply via email to