Re: [DISCUSS] KIP-1090 Flaky Tests 👻

Chia-Ping Tsai Fri, 20 Sep 2024 18:16:53 -0700

> Can modules opt out of this feature? For example, the raft module
doesn't have any integration tests and all of the tests are meant to
be deterministic. It would be dangerous to the protocol's correctness
and the consistency of the cluster metadata to allow contributors to
mark tests as flaky in the raft module instead of investigating the
source of the failure.

IMHO, the rule should be "unit test should always be deterministic
(non-retryable)"

Best,
Chia-Ping

José Armando García Sancio <jsan...@confluent.io.invalid> 於 2024年9月21日 週六
上午9:01寫道：

> Thanks for the proposal David.
>
> Can modules opt out of this feature? For example, the raft module
> doesn't have any integration tests and all of the tests are meant to
> be deterministic. It would be dangerous to the protocol's correctness
> and the consistency of the cluster metadata to allow contributors to
> mark tests as flaky in the raft module instead of investigating the
> source of the failure.
>
> On Fri, Sep 20, 2024 at 12:07 PM David Arthur <mum...@gmail.com> wrote:
> >
> > Chia-Ping, I think something like that can work. I was also thinking
> about
> > extracting the test names during trunk builds using Gradle and storing
> that
> > somewhere. I think it's fair to say we can derive this data from Git and
> > Develocity. We can probably figure out the implementation details later
> on
> > :)
> >
> > I've updated the KIP to describe the two-tiered approach as well as
> > including a diagram.
> >
> > -David
> >
> >
> > On Thu, Sep 19, 2024 at 12:40 PM Chia-Ping Tsai <chia7...@gmail.com>
> wrote:
> >
> > > > However, this doesn't help with the newly added tests that introduce
> > > flakiness. For this, we need some way
> > > to detect when a new test has been added. Still thinking through
> this...
> > >
> > > Maybe we can use git + gradle develocity to address it.
> > >
> > > 1) list the files changed recently (for example: git diff --name-only
> "@{7
> > > days ago}")
> > > 2) use the test-related filename to query Gradle Develocity with "7
> days
> > > ago"
> > > 3) the test cases having fewer builds are viewed as "new tests"
> > >
> > > WDYT?
> > >
> > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 下午11:56寫道：
> > >
> > > > TengYao,
> > > >
> > > > > These two mechanisms are independent. We could manually remove a
> tag
> > > from
> > > > a
> > > > test, but at the same time, it might still be quarantined.
> > > > I know the above situation might sound weird, but I just want to
> > > understand
> > > > how it would work.
> > > >
> > > > If we remove a tag from a test, we are signaling that we *think* it
> is
> > > now
> > > > stable. However, until we have data
> > > > to prove that, we can run the test in the quarantine. This helps
> with the
> > > > situation where there are multiple
> > > > sources of flakiness and we are making iterative progress on
> improving
> > > it.
> > > >
> > > > The alternative is to keep the fix in a PR until the developer gets
> > > enough
> > > > signal from the PR builds to be
> > > > convinced that the flakiness is gone. I mention in the KIP that this
> > > > approach has issues.
> > > >
> > > >
> > > > Chia-Ping,
> > > >
> > > > > If so, could we query the Gradle develocity to get flaky and then
> make
> > > > only those flaky
> > > > retryable in CI?
> > > >
> > > > This can work for existing flaky tests. We can query Develocity to
> find
> > > > flaky tests in the last N days
> > > > and run only those tests with retry. This is what I meant by "data
> > > driven"
> > > > for the quarantine. So, no need to track
> > > > "quarantined.txt" file separately -- we can derive this data at build
> > > time
> > > > from Develocity.
> > > >
> > > > However, this doesn't help with the newly added tests that introduce
> > > > flakiness. For this, we need some way
> > > > to detect when a new test has been added. Still thinking through
> this...
> > > >
> > > > -David
> > > >
> > > > On Thu, Sep 19, 2024 at 11:33 AM Chia-Ping Tsai <chia7...@gmail.com>
> > > > wrote:
> > > >
> > > > > hi David
> > > > >
> > > > > The two-tiered approach is interesting and I have questions
> similar to
> > > > > TengYao.
> > > > >
> > > > > BUT, go back to the usage of quarantine and isolation. It seems to
> me
> > > > they
> > > > > are used to make our CI not be noised by the flaky, right? If so,
> could
> > > > we
> > > > > query the Gradle develocity to get flaky and then make only those
> flaky
> > > > > retryable in CI? That means a non-flaky (stable) test must pass
> without
> > > > > retry.
> > > > >
> > > > > This approach is more simple (I guess), and it can isolate the
> flaky
> > > (by
> > > > > retry) from CI. If a flaky can't pass after retry, I feel it is a
> > > > critical
> > > > > failed test rather than just a flaky. On top of that, this approach
> > > does
> > > > > not need an extra file ("quarantined.txt") in git control and it
> does
> > > not
> > > > > need to change any code of test classes.
> > > > >
> > > > > Best,
> > > > > Chia-Ping
> > > > >
> > > > > TengYao Chi <kiting...@gmail.com> 於 2024年9月19日 週四 下午11:19寫道：
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > Thanks for the explanation.
> > > > > >
> > > > > > I like this two-tiered approach, which gives us more flexibility
> to
> > > > > handle
> > > > > > flaky tests.
> > > > > >
> > > > > > The following is my understanding of how it works; please
> correct me
> > > if
> > > > > I'm
> > > > > > wrong:
> > > > > > If we adopt the two-tiered approach, the test might have two
> > > > > > states.(Isolated by developer, Quarantined automatically).
> > > > > > These two mechanisms are independent. We could manually remove a
> tag
> > > > > from a
> > > > > > test, but at the same time, it might still be quarantined.
> > > > > > I know the above situation might sound weird, but I just want to
> > > > > understand
> > > > > > how it would work.
> > > > > >
> > > > > > Best Regards,
> > > > > > TengYao
> > > > > >
> > > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 下午10:07寫道：
> > > > > >
> > > > > > > Chia/TengYao/TaiJuWu, I agree that tags are a straightforward
> > > > approach.
> > > > > > In
> > > > > > > fact, my initial idea was to use tags as the isolation
> mechanism.
> > > > > > >
> > > > > > > Let me try to motivate the use of a text file a bit more.
> > > > > > >
> > > > > > > Consider the "new tests" scenario where a developer has added
> a new
> > > > > > > integration test. If we use annotations, this means someone
> (the
> > > > > original
> > > > > > > developer, or another committer) will need to raise a PR after
> a
> > > few
> > > > > days
> > > > > > > to remove the annotation (assuming the test was stable).
> > > Eventually,
> > > > I
> > > > > > was
> > > > > > > hoping to automate or partially automate this aspect of the
> system.
> > > > It
> > > > > > > seems simpler to write a script that modifies a plain text file
> > > > > compared
> > > > > > > with removing annotations in Java code.
> > > > > > >
> > > > > > > > we don't need to worry that "quarantined.txt" having
> out-of-date
> > > > test
> > > > > > > names
> > > > > > >
> > > > > > > This could be a problem, yes.
> > > > > > >
> > > > > > > ---
> > > > > > >
> > > > > > > Maybe we can consider a two-tiered approach here:
> > > > > > >
> > > > > > > Isolation (manual)
> > > > > > > * Test is marked with tag annotation
> > > > > > > * This is a permanent state until the developers think the
> test is
> > > > > > healthy
> > > > > > > again
> > > > > > > * These tests are run in a separate build step to not affect
> build
> > > > > > > outcomes, but gather data
> > > > > > >
> > > > > > >
> > > > > > > Quarantine (automated)
> > > > > > > * Tests leaving Isolation enter the Quarantine automatically
> > > > > > > * New integration tests enter the Quarantine automatically
> > > > > > > * Test stays in quarantine for a few days to evaluate
> > > > > > > * These tests are run in a separate build step to not affect
> build
> > > > > > > outcomes, but gather data
> > > > > > > * If all runs are passing, it leaves the quarantine
> > > > > > >
> > > > > > > I think with this approach, we can make the Quarantine fully
> > > > > data-driven
> > > > > > > and automated. Essentially, the build will query Develocity for
> > > flaky
> > > > > > test
> > > > > > > results from the last N days and run those tests separately.
> > > > > > >
> > > > > > >
> > > > > > > WDYT?
> > > > > > >
> > > > > > > On Thu, Sep 19, 2024 at 12:49 AM 吳岱儒 <tjwu1...@gmail.com>
> wrote:
> > > > > > >
> > > > > > > > Hi David,
> > > > > > > >
> > > > > > > > Thank you for KIP.
> > > > > > > >
> > > > > > > > Could we include percentages for each flaky test in
> > > > quarantined.txt?
> > > > > > This
> > > > > > > > would help us prioritize which tests to resolve first.
> > > > > > > >
> > > > > > > > Additionally, I would prefer to add a flaky (JUnit) tag to
> the
> > > > source
> > > > > > > code
> > > > > > > > so we can focus on these tests during development.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > TaiJuWu
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Sep 19, 2024 at 11:51 AM TengYao Chi <
> > > kiting...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi David,
> > > > > > > > >
> > > > > > > > > Thanks for this great KIP.
> > > > > > > > >
> > > > > > > > > I really appreciate the goal of this KIP, which aims to
> > > stabilize
> > > > > the
> > > > > > > > build
> > > > > > > > > and improve our confidence in CI results.
> > > > > > > > > It addresses a real issue where we've become accustomed to
> > > seeing
> > > > > > > failed
> > > > > > > > > results from CI, and this is definitely not good for the
> Kafka
> > > > > > > community.
> > > > > > > > >
> > > > > > > > > I have a question regarding this KIP:
> > > > > > > > > It seems that we need to maintain the `quarantined.txt`
> files
> > > > > > manually,
> > > > > > > > is
> > > > > > > > > that correct?
> > > > > > > > > I'm thinking this could become an issue, especially with
> the
> > > > > planned
> > > > > > > > > removal of ZK in 4.0, which will undoubtedly bring many
> changes
> > > > to
> > > > > > our
> > > > > > > > > codebase.
> > > > > > > > > Given that, maintaining the `quarantined.txt` files might
> > > become
> > > > a
> > > > > > > pain.
> > > > > > > > > It would be nice if we could maintain it programmatically.
> > > > > > > > >
> > > > > > > > > Best Regards,
> > > > > > > > > TengYao
> > > > > > > > >
> > > > > > > > > Chia-Ping Tsai <chia7...@gmail.com> 於 2024年9月19日 週四
> 上午3:24寫道：
> > > > > > > > >
> > > > > > > > > > hi David
> > > > > > > > > >
> > > > > > > > > > The KIP is beautiful and I do love a rule which makes us
> > > handle
> > > > > > those
> > > > > > > > > flaky
> > > > > > > > > > seriously.
> > > > > > > > > >
> > > > > > > > > > Regarding the "JUnit Tags", it can bring some benefits
> to us.
> > > > > > > > > >
> > > > > > > > > > 1. we can retry only the tests having "flaky" annotation.
> > > Other
> > > > > > > > non-flaky
> > > > > > > > > > tests should not be retryable
> > > > > > > > > > 2. we don't need to worry that "quarantined.txt" having
> > > > > out-of-date
> > > > > > > > test
> > > > > > > > > > names
> > > > > > > > > > 3. we can require the flaky annotation must have jira
> link.
> > > > That
> > > > > > > means
> > > > > > > > > the
> > > > > > > > > > PR's author must create the jira link for the new flaky
> > > > > > > > > >
> > > > > > > > > > Also, we can add a gradle task to generate
> "quarantined.txt"
> > > > file
> > > > > > if
> > > > > > > > > needs.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Chia-Ping
> > > > > > > > > >
> > > > > > > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四
> 上午12:02寫道：
> > > > > > > > > >
> > > > > > > > > > > Hello, Kafka community!
> > > > > > > > > > >
> > > > > > > > > > > Looking at the last 7 days of GitHub, we have 59 out
> of 64
> > > > > trunk
> > > > > > > > builds
> > > > > > > > > > > having flaky tests. Excluding timeouts (a separate
> issue),
> > > > > only 4
> > > > > > > > > builds
> > > > > > > > > > > out of the last 7 days have failed due to excess test
> > > > failures.
> > > > > > > This
> > > > > > > > is
> > > > > > > > > > > actually a slight improvement when compared with the
> last
> > > 28
> > > > > > days.
> > > > > > > > But
> > > > > > > > > > > still, this is obviously a bad situation to be in.
> > > > > > > > > > >
> > > > > > > > > > > We have previously discussed a few ideas to mitigate
> the
> > > > impact
> > > > > > > that
> > > > > > > > > > flaky
> > > > > > > > > > > tests have on our builds. For PRs, we are actually
> seeing a
> > > > lot
> > > > > > of
> > > > > > > > > > > successful status checks due to our use of the
> Develocity
> > > > test
> > > > > > > retry
> > > > > > > > > > > feature. However, the blanket use of "testRetry" is a
> bad
> > > > > > practice
> > > > > > > in
> > > > > > > > > > > my opinion. It makes it far too easy for us to ignore
> tests
> > > > > that
> > > > > > > are
> > > > > > > > > only
> > > > > > > > > > > occasionally flaky. It also applies to unit tests which
> > > > should
> > > > > > > never
> > > > > > > > be
> > > > > > > > > > > flaky.
> > > > > > > > > > >
> > > > > > > > > > > Another problem is that we are naturally introducing
> flaky
> > > > > tests
> > > > > > as
> > > > > > > > new
> > > > > > > > > > > features (and tests) are introduced. Similar to feature
> > > > > > > development,
> > > > > > > > it
> > > > > > > > > > > takes some time for tests to mature and stabilize --
> tests
> > > > are
> > > > > > > code,
> > > > > > > > > > after
> > > > > > > > > > > all.
> > > > > > > > > > >
> > > > > > > > > > > I have written down a proposal for tracking and
> managing
> > > our
> > > > > > flaky
> > > > > > > > > > tests. I
> > > > > > > > > > > have written this as a KIP even though this is an
> internal
> > > > > > change.
> > > > > > > I
> > > > > > > > > did
> > > > > > > > > > so
> > > > > > > > > > > because I would like us to discuss, debate, and
> solidify a
> > > > plan
> > > > > > --
> > > > > > > > and
> > > > > > > > > > > ultimately vote on it. A KIP seemed like a good fit.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management
> > > > > > > > > > >
> > > > > > > > > > > I have back-tested this strategy (as best as I can) to
> our
> > > > > trunk
> > > > > > > > builds
> > > > > > > > > > > from the last month using data from Develocity (i.e.,
> > > > > > > ge.apache.org
> > > > > > > > ).
> > > > > > > > > I
> > > > > > > > > > > looked at two scenarios. The first scenario was simply
> > > > > > quarantining
> > > > > > > > > tests
> > > > > > > > > > > with higher than 1% flaky failures, no test re-runs
> were
> > > > > > > considered.
> > > > > > > > > The
> > > > > > > > > > > second scenario extends the first by allowing up to 3
> total
> > > > > flaky
> > > > > > > > > > failures
> > > > > > > > > > > from non-quarantined tests (tests with less than 1%
> total
> > > > > > > flakiness).
> > > > > > > > > > >
> > > > > > > > > > > Total builds: *238*
> > > > > > > > > > > Flaky/Failed builds: *228*
> > > > > > > > > > > Flaky builds scenario 1 (quarantine only): *40*
> > > > > > > > > > > Flaky builds scenario 2 (quarantine + retry): *3*
> > > > > > > > > > >
> > > > > > > > > > > In other words, we can tackle the worst flaky failures
> with
> > > > the
> > > > > > > > > > quarantine
> > > > > > > > > > > strategy as described in the KIP and handle the long
> tail
> > > of
> > > > > > flaky
> > > > > > > > > > failures
> > > > > > > > > > > with the Develocity retry plugin. If we only had 3
> failing
> > > > > trunk
> > > > > > > > builds
> > > > > > > > > > per
> > > > > > > > > > > month to investigate, I'd say we were in pretty good
> shape
> > > :)
> > > > > > > > > > >
> > > > > > > > > > > Let me know what you think!
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > David A
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > David Arthur
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > David Arthur
> > > >
> > >
> >
> >
> > --
> > David Arthur
>
>
>
> --
> -José
>

Re: [DISCUSS] KIP-1090 Flaky Tests 👻

Reply via email to