Re: [DISCUSS] KIP-1090 Flaky Tests 👻

Chia-Ping Tsai Sat, 21 Sep 2024 02:20:50 -0700

hi José and David

Maybe the unit test should be excluded from the "Flaky Test Management" and
"retry"


1. unit test can NOT have "@Flaky" (if we decided to use annotation in the
flaky management)
2. CI will enable the retry only for integration test. This can be
addressed by adding a new flag "allowRetryUnitTest" and add filter to
"retry plugin"

the option 2 can be added to our CI easily.

WDYT?

Best,
Chia-Ping



David Arthur <mum...@gmail.com> 於 2024年9月21日 週六 上午10:27寫道：

> José,
>
> By default, unit tests will not be eligible for the automatic new test
> quarantine. Like you said, if a unit test fails, it indicates a problem.
>
> Perhaps we could auto-fail a test marked as "flaky" that is not also tagged
> as "integration"?
>
>
>
> Chia-Ping,
>
> >IMHO, the rule should be "unit test should always be deterministic
> (non-retryable)"
>
> I agree :)
>
> On Fri, Sep 20, 2024 at 9:17 PM Chia-Ping Tsai <chia7...@gmail.com> wrote:
>
> > > Can modules opt out of this feature? For example, the raft module
> > doesn't have any integration tests and all of the tests are meant to
> > be deterministic. It would be dangerous to the protocol's correctness
> > and the consistency of the cluster metadata to allow contributors to
> > mark tests as flaky in the raft module instead of investigating the
> > source of the failure.
> >
> > IMHO, the rule should be "unit test should always be deterministic
> > (non-retryable)"
> >
> > Best,
> > Chia-Ping
> >
> > José Armando García Sancio <jsan...@confluent.io.invalid> 於 2024年9月21日
> 週六
> > 上午9:01寫道：
> >
> > > Thanks for the proposal David.
> > >
> > > Can modules opt out of this feature? For example, the raft module
> > > doesn't have any integration tests and all of the tests are meant to
> > > be deterministic. It would be dangerous to the protocol's correctness
> > > and the consistency of the cluster metadata to allow contributors to
> > > mark tests as flaky in the raft module instead of investigating the
> > > source of the failure.
> > >
> > > On Fri, Sep 20, 2024 at 12:07 PM David Arthur <mum...@gmail.com>
> wrote:
> > > >
> > > > Chia-Ping, I think something like that can work. I was also thinking
> > > about
> > > > extracting the test names during trunk builds using Gradle and
> storing
> > > that
> > > > somewhere. I think it's fair to say we can derive this data from Git
> > and
> > > > Develocity. We can probably figure out the implementation details
> later
> > > on
> > > > :)
> > > >
> > > > I've updated the KIP to describe the two-tiered approach as well as
> > > > including a diagram.
> > > >
> > > > -David
> > > >
> > > >
> > > > On Thu, Sep 19, 2024 at 12:40 PM Chia-Ping Tsai <chia7...@gmail.com>
> > > wrote:
> > > >
> > > > > > However, this doesn't help with the newly added tests that
> > introduce
> > > > > flakiness. For this, we need some way
> > > > > to detect when a new test has been added. Still thinking through
> > > this...
> > > > >
> > > > > Maybe we can use git + gradle develocity to address it.
> > > > >
> > > > > 1) list the files changed recently (for example: git diff
> --name-only
> > > "@{7
> > > > > days ago}")
> > > > > 2) use the test-related filename to query Gradle Develocity with "7
> > > days
> > > > > ago"
> > > > > 3) the test cases having fewer builds are viewed as "new tests"
> > > > >
> > > > > WDYT?
> > > > >
> > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 下午11:56寫道：
> > > > >
> > > > > > TengYao,
> > > > > >
> > > > > > > These two mechanisms are independent. We could manually remove
> a
> > > tag
> > > > > from
> > > > > > a
> > > > > > test, but at the same time, it might still be quarantined.
> > > > > > I know the above situation might sound weird, but I just want to
> > > > > understand
> > > > > > how it would work.
> > > > > >
> > > > > > If we remove a tag from a test, we are signaling that we *think*
> it
> > > is
> > > > > now
> > > > > > stable. However, until we have data
> > > > > > to prove that, we can run the test in the quarantine. This helps
> > > with the
> > > > > > situation where there are multiple
> > > > > > sources of flakiness and we are making iterative progress on
> > > improving
> > > > > it.
> > > > > >
> > > > > > The alternative is to keep the fix in a PR until the developer
> gets
> > > > > enough
> > > > > > signal from the PR builds to be
> > > > > > convinced that the flakiness is gone. I mention in the KIP that
> > this
> > > > > > approach has issues.
> > > > > >
> > > > > >
> > > > > > Chia-Ping,
> > > > > >
> > > > > > > If so, could we query the Gradle develocity to get flaky and
> then
> > > make
> > > > > > only those flaky
> > > > > > retryable in CI?
> > > > > >
> > > > > > This can work for existing flaky tests. We can query Develocity
> to
> > > find
> > > > > > flaky tests in the last N days
> > > > > > and run only those tests with retry. This is what I meant by
> "data
> > > > > driven"
> > > > > > for the quarantine. So, no need to track
> > > > > > "quarantined.txt" file separately -- we can derive this data at
> > build
> > > > > time
> > > > > > from Develocity.
> > > > > >
> > > > > > However, this doesn't help with the newly added tests that
> > introduce
> > > > > > flakiness. For this, we need some way
> > > > > > to detect when a new test has been added. Still thinking through
> > > this...
> > > > > >
> > > > > > -David
> > > > > >
> > > > > > On Thu, Sep 19, 2024 at 11:33 AM Chia-Ping Tsai <
> > chia7...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > hi David
> > > > > > >
> > > > > > > The two-tiered approach is interesting and I have questions
> > > similar to
> > > > > > > TengYao.
> > > > > > >
> > > > > > > BUT, go back to the usage of quarantine and isolation. It seems
> > to
> > > me
> > > > > > they
> > > > > > > are used to make our CI not be noised by the flaky, right? If
> so,
> > > could
> > > > > > we
> > > > > > > query the Gradle develocity to get flaky and then make only
> those
> > > flaky
> > > > > > > retryable in CI? That means a non-flaky (stable) test must pass
> > > without
> > > > > > > retry.
> > > > > > >
> > > > > > > This approach is more simple (I guess), and it can isolate the
> > > flaky
> > > > > (by
> > > > > > > retry) from CI. If a flaky can't pass after retry, I feel it
> is a
> > > > > > critical
> > > > > > > failed test rather than just a flaky. On top of that, this
> > approach
> > > > > does
> > > > > > > not need an extra file ("quarantined.txt") in git control and
> it
> > > does
> > > > > not
> > > > > > > need to change any code of test classes.
> > > > > > >
> > > > > > > Best,
> > > > > > > Chia-Ping
> > > > > > >
> > > > > > > TengYao Chi <kiting...@gmail.com> 於 2024年9月19日 週四 下午11:19寫道：
> > > > > > >
> > > > > > > > Hi David,
> > > > > > > >
> > > > > > > > Thanks for the explanation.
> > > > > > > >
> > > > > > > > I like this two-tiered approach, which gives us more
> > flexibility
> > > to
> > > > > > > handle
> > > > > > > > flaky tests.
> > > > > > > >
> > > > > > > > The following is my understanding of how it works; please
> > > correct me
> > > > > if
> > > > > > > I'm
> > > > > > > > wrong:
> > > > > > > > If we adopt the two-tiered approach, the test might have two
> > > > > > > > states.(Isolated by developer, Quarantined automatically).
> > > > > > > > These two mechanisms are independent. We could manually
> remove
> > a
> > > tag
> > > > > > > from a
> > > > > > > > test, but at the same time, it might still be quarantined.
> > > > > > > > I know the above situation might sound weird, but I just want
> > to
> > > > > > > understand
> > > > > > > > how it would work.
> > > > > > > >
> > > > > > > > Best Regards,
> > > > > > > > TengYao
> > > > > > > >
> > > > > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 下午10:07寫道：
> > > > > > > >
> > > > > > > > > Chia/TengYao/TaiJuWu, I agree that tags are a
> straightforward
> > > > > > approach.
> > > > > > > > In
> > > > > > > > > fact, my initial idea was to use tags as the isolation
> > > mechanism.
> > > > > > > > >
> > > > > > > > > Let me try to motivate the use of a text file a bit more.
> > > > > > > > >
> > > > > > > > > Consider the "new tests" scenario where a developer has
> added
> > > a new
> > > > > > > > > integration test. If we use annotations, this means someone
> > > (the
> > > > > > > original
> > > > > > > > > developer, or another committer) will need to raise a PR
> > after
> > > a
> > > > > few
> > > > > > > days
> > > > > > > > > to remove the annotation (assuming the test was stable).
> > > > > Eventually,
> > > > > > I
> > > > > > > > was
> > > > > > > > > hoping to automate or partially automate this aspect of the
> > > system.
> > > > > > It
> > > > > > > > > seems simpler to write a script that modifies a plain text
> > file
> > > > > > > compared
> > > > > > > > > with removing annotations in Java code.
> > > > > > > > >
> > > > > > > > > > we don't need to worry that "quarantined.txt" having
> > > out-of-date
> > > > > > test
> > > > > > > > > names
> > > > > > > > >
> > > > > > > > > This could be a problem, yes.
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > Maybe we can consider a two-tiered approach here:
> > > > > > > > >
> > > > > > > > > Isolation (manual)
> > > > > > > > > * Test is marked with tag annotation
> > > > > > > > > * This is a permanent state until the developers think the
> > > test is
> > > > > > > > healthy
> > > > > > > > > again
> > > > > > > > > * These tests are run in a separate build step to not
> affect
> > > build
> > > > > > > > > outcomes, but gather data
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Quarantine (automated)
> > > > > > > > > * Tests leaving Isolation enter the Quarantine
> automatically
> > > > > > > > > * New integration tests enter the Quarantine automatically
> > > > > > > > > * Test stays in quarantine for a few days to evaluate
> > > > > > > > > * These tests are run in a separate build step to not
> affect
> > > build
> > > > > > > > > outcomes, but gather data
> > > > > > > > > * If all runs are passing, it leaves the quarantine
> > > > > > > > >
> > > > > > > > > I think with this approach, we can make the Quarantine
> fully
> > > > > > > data-driven
> > > > > > > > > and automated. Essentially, the build will query Develocity
> > for
> > > > > flaky
> > > > > > > > test
> > > > > > > > > results from the last N days and run those tests
> separately.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > WDYT?
> > > > > > > > >
> > > > > > > > > On Thu, Sep 19, 2024 at 12:49 AM 吳岱儒 <tjwu1...@gmail.com>
> > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi David,
> > > > > > > > > >
> > > > > > > > > > Thank you for KIP.
> > > > > > > > > >
> > > > > > > > > > Could we include percentages for each flaky test in
> > > > > > quarantined.txt?
> > > > > > > > This
> > > > > > > > > > would help us prioritize which tests to resolve first.
> > > > > > > > > >
> > > > > > > > > > Additionally, I would prefer to add a flaky (JUnit) tag
> to
> > > the
> > > > > > source
> > > > > > > > > code
> > > > > > > > > > so we can focus on these tests during development.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > TaiJuWu
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Sep 19, 2024 at 11:51 AM TengYao Chi <
> > > > > kiting...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi David,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for this great KIP.
> > > > > > > > > > >
> > > > > > > > > > > I really appreciate the goal of this KIP, which aims to
> > > > > stabilize
> > > > > > > the
> > > > > > > > > > build
> > > > > > > > > > > and improve our confidence in CI results.
> > > > > > > > > > > It addresses a real issue where we've become accustomed
> > to
> > > > > seeing
> > > > > > > > > failed
> > > > > > > > > > > results from CI, and this is definitely not good for
> the
> > > Kafka
> > > > > > > > > community.
> > > > > > > > > > >
> > > > > > > > > > > I have a question regarding this KIP:
> > > > > > > > > > > It seems that we need to maintain the `quarantined.txt`
> > > files
> > > > > > > > manually,
> > > > > > > > > > is
> > > > > > > > > > > that correct?
> > > > > > > > > > > I'm thinking this could become an issue, especially
> with
> > > the
> > > > > > > planned
> > > > > > > > > > > removal of ZK in 4.0, which will undoubtedly bring many
> > > changes
> > > > > > to
> > > > > > > > our
> > > > > > > > > > > codebase.
> > > > > > > > > > > Given that, maintaining the `quarantined.txt` files
> might
> > > > > become
> > > > > > a
> > > > > > > > > pain.
> > > > > > > > > > > It would be nice if we could maintain it
> > programmatically.
> > > > > > > > > > >
> > > > > > > > > > > Best Regards,
> > > > > > > > > > > TengYao
> > > > > > > > > > >
> > > > > > > > > > > Chia-Ping Tsai <chia7...@gmail.com> 於 2024年9月19日 週四
> > > 上午3:24寫道：
> > > > > > > > > > >
> > > > > > > > > > > > hi David
> > > > > > > > > > > >
> > > > > > > > > > > > The KIP is beautiful and I do love a rule which makes
> > us
> > > > > handle
> > > > > > > > those
> > > > > > > > > > > flaky
> > > > > > > > > > > > seriously.
> > > > > > > > > > > >
> > > > > > > > > > > > Regarding the "JUnit Tags", it can bring some
> benefits
> > > to us.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. we can retry only the tests having "flaky"
> > annotation.
> > > > > Other
> > > > > > > > > > non-flaky
> > > > > > > > > > > > tests should not be retryable
> > > > > > > > > > > > 2. we don't need to worry that "quarantined.txt"
> having
> > > > > > > out-of-date
> > > > > > > > > > test
> > > > > > > > > > > > names
> > > > > > > > > > > > 3. we can require the flaky annotation must have jira
> > > link.
> > > > > > That
> > > > > > > > > means
> > > > > > > > > > > the
> > > > > > > > > > > > PR's author must create the jira link for the new
> flaky
> > > > > > > > > > > >
> > > > > > > > > > > > Also, we can add a gradle task to generate
> > > "quarantined.txt"
> > > > > > file
> > > > > > > > if
> > > > > > > > > > > needs.
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Chia-Ping
> > > > > > > > > > > >
> > > > > > > > > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四
> > > 上午12:02寫道：
> > > > > > > > > > > >
> > > > > > > > > > > > > Hello, Kafka community!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Looking at the last 7 days of GitHub, we have 59
> out
> > > of 64
> > > > > > > trunk
> > > > > > > > > > builds
> > > > > > > > > > > > > having flaky tests. Excluding timeouts (a separate
> > > issue),
> > > > > > > only 4
> > > > > > > > > > > builds
> > > > > > > > > > > > > out of the last 7 days have failed due to excess
> test
> > > > > > failures.
> > > > > > > > > This
> > > > > > > > > > is
> > > > > > > > > > > > > actually a slight improvement when compared with
> the
> > > last
> > > > > 28
> > > > > > > > days.
> > > > > > > > > > But
> > > > > > > > > > > > > still, this is obviously a bad situation to be in.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We have previously discussed a few ideas to
> mitigate
> > > the
> > > > > > impact
> > > > > > > > > that
> > > > > > > > > > > > flaky
> > > > > > > > > > > > > tests have on our builds. For PRs, we are actually
> > > seeing a
> > > > > > lot
> > > > > > > > of
> > > > > > > > > > > > > successful status checks due to our use of the
> > > Develocity
> > > > > > test
> > > > > > > > > retry
> > > > > > > > > > > > > feature. However, the blanket use of "testRetry"
> is a
> > > bad
> > > > > > > > practice
> > > > > > > > > in
> > > > > > > > > > > > > my opinion. It makes it far too easy for us to
> ignore
> > > tests
> > > > > > > that
> > > > > > > > > are
> > > > > > > > > > > only
> > > > > > > > > > > > > occasionally flaky. It also applies to unit tests
> > which
> > > > > > should
> > > > > > > > > never
> > > > > > > > > > be
> > > > > > > > > > > > > flaky.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Another problem is that we are naturally
> introducing
> > > flaky
> > > > > > > tests
> > > > > > > > as
> > > > > > > > > > new
> > > > > > > > > > > > > features (and tests) are introduced. Similar to
> > feature
> > > > > > > > > development,
> > > > > > > > > > it
> > > > > > > > > > > > > takes some time for tests to mature and stabilize
> --
> > > tests
> > > > > > are
> > > > > > > > > code,
> > > > > > > > > > > > after
> > > > > > > > > > > > > all.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have written down a proposal for tracking and
> > > managing
> > > > > our
> > > > > > > > flaky
> > > > > > > > > > > > tests. I
> > > > > > > > > > > > > have written this as a KIP even though this is an
> > > internal
> > > > > > > > change.
> > > > > > > > > I
> > > > > > > > > > > did
> > > > > > > > > > > > so
> > > > > > > > > > > > > because I would like us to discuss, debate, and
> > > solidify a
> > > > > > plan
> > > > > > > > --
> > > > > > > > > > and
> > > > > > > > > > > > > ultimately vote on it. A KIP seemed like a good
> fit.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have back-tested this strategy (as best as I can)
> > to
> > > our
> > > > > > > trunk
> > > > > > > > > > builds
> > > > > > > > > > > > > from the last month using data from Develocity
> (i.e.,
> > > > > > > > > ge.apache.org
> > > > > > > > > > ).
> > > > > > > > > > > I
> > > > > > > > > > > > > looked at two scenarios. The first scenario was
> > simply
> > > > > > > > quarantining
> > > > > > > > > > > tests
> > > > > > > > > > > > > with higher than 1% flaky failures, no test re-runs
> > > were
> > > > > > > > > considered.
> > > > > > > > > > > The
> > > > > > > > > > > > > second scenario extends the first by allowing up
> to 3
> > > total
> > > > > > > flaky
> > > > > > > > > > > > failures
> > > > > > > > > > > > > from non-quarantined tests (tests with less than 1%
> > > total
> > > > > > > > > flakiness).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Total builds: *238*
> > > > > > > > > > > > > Flaky/Failed builds: *228*
> > > > > > > > > > > > > Flaky builds scenario 1 (quarantine only): *40*
> > > > > > > > > > > > > Flaky builds scenario 2 (quarantine + retry): *3*
> > > > > > > > > > > > >
> > > > > > > > > > > > > In other words, we can tackle the worst flaky
> > failures
> > > with
> > > > > > the
> > > > > > > > > > > > quarantine
> > > > > > > > > > > > > strategy as described in the KIP and handle the
> long
> > > tail
> > > > > of
> > > > > > > > flaky
> > > > > > > > > > > > failures
> > > > > > > > > > > > > with the Develocity retry plugin. If we only had 3
> > > failing
> > > > > > > trunk
> > > > > > > > > > builds
> > > > > > > > > > > > per
> > > > > > > > > > > > > month to investigate, I'd say we were in pretty
> good
> > > shape
> > > > > :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Let me know what you think!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > David A
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > David Arthur
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > David Arthur
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > David Arthur
> > >
> > >
> > >
> > > --
> > > -José
> > >
> >
>
>
> --
> David Arthur
>

Re: [DISCUSS] KIP-1090 Flaky Tests 👻

Reply via email to