Re: [DISCUSS] KIP-1090 Flaky Tests 👻

David Arthur Thu, 19 Sep 2024 08:54:00 -0700

TengYao,

> These two mechanisms are independent. We could manually remove a tag from
a
test, but at the same time, it might still be quarantined.
I know the above situation might sound weird, but I just want to understand
how it would work.


If we remove a tag from a test, we are signaling that we *think* it is now
stable. However, until we have data
to prove that, we can run the test in the quarantine. This helps with the
situation where there are multiple
sources of flakiness and we are making iterative progress on improving it.

The alternative is to keep the fix in a PR until the developer gets enough
signal from the PR builds to be
convinced that the flakiness is gone. I mention in the KIP that this
approach has issues.


Chia-Ping,

> If so, could we query the Gradle develocity to get flaky and then make
only those flaky
retryable in CI?

This can work for existing flaky tests. We can query Develocity to find
flaky tests in the last N days
and run only those tests with retry. This is what I meant by "data driven"
for the quarantine. So, no need to track
"quarantined.txt" file separately -- we can derive this data at build time
from Develocity.

However, this doesn't help with the newly added tests that introduce
flakiness. For this, we need some way
to detect when a new test has been added. Still thinking through this...

-David

On Thu, Sep 19, 2024 at 11:33 AM Chia-Ping Tsai <chia7...@gmail.com> wrote:

> hi David
>
> The two-tiered approach is interesting and I have questions similar to
> TengYao.
>
> BUT, go back to the usage of quarantine and isolation. It seems to me they
> are used to make our CI not be noised by the flaky, right? If so, could we
> query the Gradle develocity to get flaky and then make only those flaky
> retryable in CI? That means a non-flaky (stable) test must pass without
> retry.
>
> This approach is more simple (I guess), and it can isolate the flaky (by
> retry) from CI. If a flaky can't pass after retry, I feel it is a critical
> failed test rather than just a flaky. On top of that, this approach does
> not need an extra file ("quarantined.txt") in git control and it does not
> need to change any code of test classes.
>
> Best,
> Chia-Ping
>
> TengYao Chi <kiting...@gmail.com> 於 2024年9月19日 週四 下午11:19寫道：
>
> > Hi David,
> >
> > Thanks for the explanation.
> >
> > I like this two-tiered approach, which gives us more flexibility to
> handle
> > flaky tests.
> >
> > The following is my understanding of how it works; please correct me if
> I'm
> > wrong:
> > If we adopt the two-tiered approach, the test might have two
> > states.(Isolated by developer, Quarantined automatically).
> > These two mechanisms are independent. We could manually remove a tag
> from a
> > test, but at the same time, it might still be quarantined.
> > I know the above situation might sound weird, but I just want to
> understand
> > how it would work.
> >
> > Best Regards,
> > TengYao
> >
> > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 下午10:07寫道：
> >
> > > Chia/TengYao/TaiJuWu, I agree that tags are a straightforward approach.
> > In
> > > fact, my initial idea was to use tags as the isolation mechanism.
> > >
> > > Let me try to motivate the use of a text file a bit more.
> > >
> > > Consider the "new tests" scenario where a developer has added a new
> > > integration test. If we use annotations, this means someone (the
> original
> > > developer, or another committer) will need to raise a PR after a few
> days
> > > to remove the annotation (assuming the test was stable). Eventually, I
> > was
> > > hoping to automate or partially automate this aspect of the system. It
> > > seems simpler to write a script that modifies a plain text file
> compared
> > > with removing annotations in Java code.
> > >
> > > > we don't need to worry that "quarantined.txt" having out-of-date test
> > > names
> > >
> > > This could be a problem, yes.
> > >
> > > ---
> > >
> > > Maybe we can consider a two-tiered approach here:
> > >
> > > Isolation (manual)
> > > * Test is marked with tag annotation
> > > * This is a permanent state until the developers think the test is
> > healthy
> > > again
> > > * These tests are run in a separate build step to not affect build
> > > outcomes, but gather data
> > >
> > >
> > > Quarantine (automated)
> > > * Tests leaving Isolation enter the Quarantine automatically
> > > * New integration tests enter the Quarantine automatically
> > > * Test stays in quarantine for a few days to evaluate
> > > * These tests are run in a separate build step to not affect build
> > > outcomes, but gather data
> > > * If all runs are passing, it leaves the quarantine
> > >
> > > I think with this approach, we can make the Quarantine fully
> data-driven
> > > and automated. Essentially, the build will query Develocity for flaky
> > test
> > > results from the last N days and run those tests separately.
> > >
> > >
> > > WDYT?
> > >
> > > On Thu, Sep 19, 2024 at 12:49 AM 吳岱儒 <tjwu1...@gmail.com> wrote:
> > >
> > > > Hi David,
> > > >
> > > > Thank you for KIP.
> > > >
> > > > Could we include percentages for each flaky test in quarantined.txt?
> > This
> > > > would help us prioritize which tests to resolve first.
> > > >
> > > > Additionally, I would prefer to add a flaky (JUnit) tag to the source
> > > code
> > > > so we can focus on these tests during development.
> > > >
> > > > Thanks,
> > > > TaiJuWu
> > > >
> > > >
> > > > On Thu, Sep 19, 2024 at 11:51 AM TengYao Chi <kiting...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi David,
> > > > >
> > > > > Thanks for this great KIP.
> > > > >
> > > > > I really appreciate the goal of this KIP, which aims to stabilize
> the
> > > > build
> > > > > and improve our confidence in CI results.
> > > > > It addresses a real issue where we've become accustomed to seeing
> > > failed
> > > > > results from CI, and this is definitely not good for the Kafka
> > > community.
> > > > >
> > > > > I have a question regarding this KIP:
> > > > > It seems that we need to maintain the `quarantined.txt` files
> > manually,
> > > > is
> > > > > that correct?
> > > > > I'm thinking this could become an issue, especially with the
> planned
> > > > > removal of ZK in 4.0, which will undoubtedly bring many changes to
> > our
> > > > > codebase.
> > > > > Given that, maintaining the `quarantined.txt` files might become a
> > > pain.
> > > > > It would be nice if we could maintain it programmatically.
> > > > >
> > > > > Best Regards,
> > > > > TengYao
> > > > >
> > > > > Chia-Ping Tsai <chia7...@gmail.com> 於 2024年9月19日 週四 上午3:24寫道：
> > > > >
> > > > > > hi David
> > > > > >
> > > > > > The KIP is beautiful and I do love a rule which makes us handle
> > those
> > > > > flaky
> > > > > > seriously.
> > > > > >
> > > > > > Regarding the "JUnit Tags", it can bring some benefits to us.
> > > > > >
> > > > > > 1. we can retry only the tests having "flaky" annotation. Other
> > > > non-flaky
> > > > > > tests should not be retryable
> > > > > > 2. we don't need to worry that "quarantined.txt" having
> out-of-date
> > > > test
> > > > > > names
> > > > > > 3. we can require the flaky annotation must have jira link. That
> > > means
> > > > > the
> > > > > > PR's author must create the jira link for the new flaky
> > > > > >
> > > > > > Also, we can add a gradle task to generate "quarantined.txt" file
> > if
> > > > > needs.
> > > > > >
> > > > > > Best,
> > > > > > Chia-Ping
> > > > > >
> > > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 上午12:02寫道：
> > > > > >
> > > > > > > Hello, Kafka community!
> > > > > > >
> > > > > > > Looking at the last 7 days of GitHub, we have 59 out of 64
> trunk
> > > > builds
> > > > > > > having flaky tests. Excluding timeouts (a separate issue),
> only 4
> > > > > builds
> > > > > > > out of the last 7 days have failed due to excess test failures.
> > > This
> > > > is
> > > > > > > actually a slight improvement when compared with the last 28
> > days.
> > > > But
> > > > > > > still, this is obviously a bad situation to be in.
> > > > > > >
> > > > > > > We have previously discussed a few ideas to mitigate the impact
> > > that
> > > > > > flaky
> > > > > > > tests have on our builds. For PRs, we are actually seeing a lot
> > of
> > > > > > > successful status checks due to our use of the Develocity test
> > > retry
> > > > > > > feature. However, the blanket use of "testRetry" is a bad
> > practice
> > > in
> > > > > > > my opinion. It makes it far too easy for us to ignore tests
> that
> > > are
> > > > > only
> > > > > > > occasionally flaky. It also applies to unit tests which should
> > > never
> > > > be
> > > > > > > flaky.
> > > > > > >
> > > > > > > Another problem is that we are naturally introducing flaky
> tests
> > as
> > > > new
> > > > > > > features (and tests) are introduced. Similar to feature
> > > development,
> > > > it
> > > > > > > takes some time for tests to mature and stabilize -- tests are
> > > code,
> > > > > > after
> > > > > > > all.
> > > > > > >
> > > > > > > I have written down a proposal for tracking and managing our
> > flaky
> > > > > > tests. I
> > > > > > > have written this as a KIP even though this is an internal
> > change.
> > > I
> > > > > did
> > > > > > so
> > > > > > > because I would like us to discuss, debate, and solidify a plan
> > --
> > > > and
> > > > > > > ultimately vote on it. A KIP seemed like a good fit.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management
> > > > > > >
> > > > > > > I have back-tested this strategy (as best as I can) to our
> trunk
> > > > builds
> > > > > > > from the last month using data from Develocity (i.e.,
> > > ge.apache.org
> > > > ).
> > > > > I
> > > > > > > looked at two scenarios. The first scenario was simply
> > quarantining
> > > > > tests
> > > > > > > with higher than 1% flaky failures, no test re-runs were
> > > considered.
> > > > > The
> > > > > > > second scenario extends the first by allowing up to 3 total
> flaky
> > > > > > failures
> > > > > > > from non-quarantined tests (tests with less than 1% total
> > > flakiness).
> > > > > > >
> > > > > > > Total builds: *238*
> > > > > > > Flaky/Failed builds: *228*
> > > > > > > Flaky builds scenario 1 (quarantine only): *40*
> > > > > > > Flaky builds scenario 2 (quarantine + retry): *3*
> > > > > > >
> > > > > > > In other words, we can tackle the worst flaky failures with the
> > > > > > quarantine
> > > > > > > strategy as described in the KIP and handle the long tail of
> > flaky
> > > > > > failures
> > > > > > > with the Develocity retry plugin. If we only had 3 failing
> trunk
> > > > builds
> > > > > > per
> > > > > > > month to investigate, I'd say we were in pretty good shape :)
> > > > > > >
> > > > > > > Let me know what you think!
> > > > > > >
> > > > > > > Cheers,
> > > > > > > David A
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > David Arthur
> > >
> >
>


-- 
David Arthur

Re: [DISCUSS] KIP-1090 Flaky Tests 👻

Reply via email to