Re: [DISCUSS] KIP-1090 Flaky Tests 👻

吳岱儒 Wed, 18 Sep 2024 21:49:29 -0700

Hi David,

Thank you for KIP.


Could we include percentages for each flaky test in quarantined.txt? This
would help us prioritize which tests to resolve first.

Additionally, I would prefer to add a flaky (JUnit) tag to the source code
so we can focus on these tests during development.

Thanks,
TaiJuWu


On Thu, Sep 19, 2024 at 11:51 AM TengYao Chi <kiting...@gmail.com> wrote:

> Hi David,
>
> Thanks for this great KIP.
>
> I really appreciate the goal of this KIP, which aims to stabilize the build
> and improve our confidence in CI results.
> It addresses a real issue where we've become accustomed to seeing failed
> results from CI, and this is definitely not good for the Kafka community.
>
> I have a question regarding this KIP:
> It seems that we need to maintain the `quarantined.txt` files manually, is
> that correct?
> I'm thinking this could become an issue, especially with the planned
> removal of ZK in 4.0, which will undoubtedly bring many changes to our
> codebase.
> Given that, maintaining the `quarantined.txt` files might become a pain.
> It would be nice if we could maintain it programmatically.
>
> Best Regards,
> TengYao
>
> Chia-Ping Tsai <chia7...@gmail.com> 於 2024年9月19日 週四 上午3:24寫道：
>
> > hi David
> >
> > The KIP is beautiful and I do love a rule which makes us handle those
> flaky
> > seriously.
> >
> > Regarding the "JUnit Tags", it can bring some benefits to us.
> >
> > 1. we can retry only the tests having "flaky" annotation. Other non-flaky
> > tests should not be retryable
> > 2. we don't need to worry that "quarantined.txt" having out-of-date test
> > names
> > 3. we can require the flaky annotation must have jira link. That means
> the
> > PR's author must create the jira link for the new flaky
> >
> > Also, we can add a gradle task to generate "quarantined.txt" file if
> needs.
> >
> > Best,
> > Chia-Ping
> >
> > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 上午12:02寫道：
> >
> > > Hello, Kafka community!
> > >
> > > Looking at the last 7 days of GitHub, we have 59 out of 64 trunk builds
> > > having flaky tests. Excluding timeouts (a separate issue), only 4
> builds
> > > out of the last 7 days have failed due to excess test failures. This is
> > > actually a slight improvement when compared with the last 28 days. But
> > > still, this is obviously a bad situation to be in.
> > >
> > > We have previously discussed a few ideas to mitigate the impact that
> > flaky
> > > tests have on our builds. For PRs, we are actually seeing a lot of
> > > successful status checks due to our use of the Develocity test retry
> > > feature. However, the blanket use of "testRetry" is a bad practice in
> > > my opinion. It makes it far too easy for us to ignore tests that are
> only
> > > occasionally flaky. It also applies to unit tests which should never be
> > > flaky.
> > >
> > > Another problem is that we are naturally introducing flaky tests as new
> > > features (and tests) are introduced. Similar to feature development, it
> > > takes some time for tests to mature and stabilize -- tests are code,
> > after
> > > all.
> > >
> > > I have written down a proposal for tracking and managing our flaky
> > tests. I
> > > have written this as a KIP even though this is an internal change. I
> did
> > so
> > > because I would like us to discuss, debate, and solidify a plan -- and
> > > ultimately vote on it. A KIP seemed like a good fit.
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management
> > >
> > > I have back-tested this strategy (as best as I can) to our trunk builds
> > > from the last month using data from Develocity (i.e., ge.apache.org).
> I
> > > looked at two scenarios. The first scenario was simply quarantining
> tests
> > > with higher than 1% flaky failures, no test re-runs were considered.
> The
> > > second scenario extends the first by allowing up to 3 total flaky
> > failures
> > > from non-quarantined tests (tests with less than 1% total flakiness).
> > >
> > > Total builds: *238*
> > > Flaky/Failed builds: *228*
> > > Flaky builds scenario 1 (quarantine only): *40*
> > > Flaky builds scenario 2 (quarantine + retry): *3*
> > >
> > > In other words, we can tackle the worst flaky failures with the
> > quarantine
> > > strategy as described in the KIP and handle the long tail of flaky
> > failures
> > > with the Develocity retry plugin. If we only had 3 failing trunk builds
> > per
> > > month to investigate, I'd say we were in pretty good shape :)
> > >
> > > Let me know what you think!
> > >
> > > Cheers,
> > > David A
> > >
> >
>

Re: [DISCUSS] KIP-1090 Flaky Tests 👻

Reply via email to