Re: [DISCUSS] KIP-1090 Flaky Tests 👻

José Armando García Sancio Fri, 20 Sep 2024 18:01:38 -0700

Thanks for the proposal David.

Can modules opt out of this feature? For example, the raft module
doesn't have any integration tests and all of the tests are meant to
be deterministic. It would be dangerous to the protocol's correctness
and the consistency of the cluster metadata to allow contributors to
mark tests as flaky in the raft module instead of investigating the
source of the failure.


On Fri, Sep 20, 2024 at 12:07 PM David Arthur <mum...@gmail.com> wrote:
>
> Chia-Ping, I think something like that can work. I was also thinking about
> extracting the test names during trunk builds using Gradle and storing that
> somewhere. I think it's fair to say we can derive this data from Git and
> Develocity. We can probably figure out the implementation details later on
> :)
>
> I've updated the KIP to describe the two-tiered approach as well as
> including a diagram.
>
> -David
>
>
> On Thu, Sep 19, 2024 at 12:40 PM Chia-Ping Tsai <chia7...@gmail.com> wrote:
>
> > > However, this doesn't help with the newly added tests that introduce
> > flakiness. For this, we need some way
> > to detect when a new test has been added. Still thinking through this...
> >
> > Maybe we can use git + gradle develocity to address it.
> >
> > 1) list the files changed recently (for example: git diff --name-only "@{7
> > days ago}")
> > 2) use the test-related filename to query Gradle Develocity with "7 days
> > ago"
> > 3) the test cases having fewer builds are viewed as "new tests"
> >
> > WDYT?
> >
> > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 下午11:56寫道：
> >
> > > TengYao,
> > >
> > > > These two mechanisms are independent. We could manually remove a tag
> > from
> > > a
> > > test, but at the same time, it might still be quarantined.
> > > I know the above situation might sound weird, but I just want to
> > understand
> > > how it would work.
> > >
> > > If we remove a tag from a test, we are signaling that we *think* it is
> > now
> > > stable. However, until we have data
> > > to prove that, we can run the test in the quarantine. This helps with the
> > > situation where there are multiple
> > > sources of flakiness and we are making iterative progress on improving
> > it.
> > >
> > > The alternative is to keep the fix in a PR until the developer gets
> > enough
> > > signal from the PR builds to be
> > > convinced that the flakiness is gone. I mention in the KIP that this
> > > approach has issues.
> > >
> > >
> > > Chia-Ping,
> > >
> > > > If so, could we query the Gradle develocity to get flaky and then make
> > > only those flaky
> > > retryable in CI?
> > >
> > > This can work for existing flaky tests. We can query Develocity to find
> > > flaky tests in the last N days
> > > and run only those tests with retry. This is what I meant by "data
> > driven"
> > > for the quarantine. So, no need to track
> > > "quarantined.txt" file separately -- we can derive this data at build
> > time
> > > from Develocity.
> > >
> > > However, this doesn't help with the newly added tests that introduce
> > > flakiness. For this, we need some way
> > > to detect when a new test has been added. Still thinking through this...
> > >
> > > -David
> > >
> > > On Thu, Sep 19, 2024 at 11:33 AM Chia-Ping Tsai <chia7...@gmail.com>
> > > wrote:
> > >
> > > > hi David
> > > >
> > > > The two-tiered approach is interesting and I have questions similar to
> > > > TengYao.
> > > >
> > > > BUT, go back to the usage of quarantine and isolation. It seems to me
> > > they
> > > > are used to make our CI not be noised by the flaky, right? If so, could
> > > we
> > > > query the Gradle develocity to get flaky and then make only those flaky
> > > > retryable in CI? That means a non-flaky (stable) test must pass without
> > > > retry.
> > > >
> > > > This approach is more simple (I guess), and it can isolate the flaky
> > (by
> > > > retry) from CI. If a flaky can't pass after retry, I feel it is a
> > > critical
> > > > failed test rather than just a flaky. On top of that, this approach
> > does
> > > > not need an extra file ("quarantined.txt") in git control and it does
> > not
> > > > need to change any code of test classes.
> > > >
> > > > Best,
> > > > Chia-Ping
> > > >
> > > > TengYao Chi <kiting...@gmail.com> 於 2024年9月19日 週四 下午11:19寫道：
> > > >
> > > > > Hi David,
> > > > >
> > > > > Thanks for the explanation.
> > > > >
> > > > > I like this two-tiered approach, which gives us more flexibility to
> > > > handle
> > > > > flaky tests.
> > > > >
> > > > > The following is my understanding of how it works; please correct me
> > if
> > > > I'm
> > > > > wrong:
> > > > > If we adopt the two-tiered approach, the test might have two
> > > > > states.(Isolated by developer, Quarantined automatically).
> > > > > These two mechanisms are independent. We could manually remove a tag
> > > > from a
> > > > > test, but at the same time, it might still be quarantined.
> > > > > I know the above situation might sound weird, but I just want to
> > > > understand
> > > > > how it would work.
> > > > >
> > > > > Best Regards,
> > > > > TengYao
> > > > >
> > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 下午10:07寫道：
> > > > >
> > > > > > Chia/TengYao/TaiJuWu, I agree that tags are a straightforward
> > > approach.
> > > > > In
> > > > > > fact, my initial idea was to use tags as the isolation mechanism.
> > > > > >
> > > > > > Let me try to motivate the use of a text file a bit more.
> > > > > >
> > > > > > Consider the "new tests" scenario where a developer has added a new
> > > > > > integration test. If we use annotations, this means someone (the
> > > > original
> > > > > > developer, or another committer) will need to raise a PR after a
> > few
> > > > days
> > > > > > to remove the annotation (assuming the test was stable).
> > Eventually,
> > > I
> > > > > was
> > > > > > hoping to automate or partially automate this aspect of the system.
> > > It
> > > > > > seems simpler to write a script that modifies a plain text file
> > > > compared
> > > > > > with removing annotations in Java code.
> > > > > >
> > > > > > > we don't need to worry that "quarantined.txt" having out-of-date
> > > test
> > > > > > names
> > > > > >
> > > > > > This could be a problem, yes.
> > > > > >
> > > > > > ---
> > > > > >
> > > > > > Maybe we can consider a two-tiered approach here:
> > > > > >
> > > > > > Isolation (manual)
> > > > > > * Test is marked with tag annotation
> > > > > > * This is a permanent state until the developers think the test is
> > > > > healthy
> > > > > > again
> > > > > > * These tests are run in a separate build step to not affect build
> > > > > > outcomes, but gather data
> > > > > >
> > > > > >
> > > > > > Quarantine (automated)
> > > > > > * Tests leaving Isolation enter the Quarantine automatically
> > > > > > * New integration tests enter the Quarantine automatically
> > > > > > * Test stays in quarantine for a few days to evaluate
> > > > > > * These tests are run in a separate build step to not affect build
> > > > > > outcomes, but gather data
> > > > > > * If all runs are passing, it leaves the quarantine
> > > > > >
> > > > > > I think with this approach, we can make the Quarantine fully
> > > > data-driven
> > > > > > and automated. Essentially, the build will query Develocity for
> > flaky
> > > > > test
> > > > > > results from the last N days and run those tests separately.
> > > > > >
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > > > On Thu, Sep 19, 2024 at 12:49 AM 吳岱儒 <tjwu1...@gmail.com> wrote:
> > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > Thank you for KIP.
> > > > > > >
> > > > > > > Could we include percentages for each flaky test in
> > > quarantined.txt?
> > > > > This
> > > > > > > would help us prioritize which tests to resolve first.
> > > > > > >
> > > > > > > Additionally, I would prefer to add a flaky (JUnit) tag to the
> > > source
> > > > > > code
> > > > > > > so we can focus on these tests during development.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > TaiJuWu
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Sep 19, 2024 at 11:51 AM TengYao Chi <
> > kiting...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi David,
> > > > > > > >
> > > > > > > > Thanks for this great KIP.
> > > > > > > >
> > > > > > > > I really appreciate the goal of this KIP, which aims to
> > stabilize
> > > > the
> > > > > > > build
> > > > > > > > and improve our confidence in CI results.
> > > > > > > > It addresses a real issue where we've become accustomed to
> > seeing
> > > > > > failed
> > > > > > > > results from CI, and this is definitely not good for the Kafka
> > > > > > community.
> > > > > > > >
> > > > > > > > I have a question regarding this KIP:
> > > > > > > > It seems that we need to maintain the `quarantined.txt` files
> > > > > manually,
> > > > > > > is
> > > > > > > > that correct?
> > > > > > > > I'm thinking this could become an issue, especially with the
> > > > planned
> > > > > > > > removal of ZK in 4.0, which will undoubtedly bring many changes
> > > to
> > > > > our
> > > > > > > > codebase.
> > > > > > > > Given that, maintaining the `quarantined.txt` files might
> > become
> > > a
> > > > > > pain.
> > > > > > > > It would be nice if we could maintain it programmatically.
> > > > > > > >
> > > > > > > > Best Regards,
> > > > > > > > TengYao
> > > > > > > >
> > > > > > > > Chia-Ping Tsai <chia7...@gmail.com> 於 2024年9月19日 週四 上午3:24寫道：
> > > > > > > >
> > > > > > > > > hi David
> > > > > > > > >
> > > > > > > > > The KIP is beautiful and I do love a rule which makes us
> > handle
> > > > > those
> > > > > > > > flaky
> > > > > > > > > seriously.
> > > > > > > > >
> > > > > > > > > Regarding the "JUnit Tags", it can bring some benefits to us.
> > > > > > > > >
> > > > > > > > > 1. we can retry only the tests having "flaky" annotation.
> > Other
> > > > > > > non-flaky
> > > > > > > > > tests should not be retryable
> > > > > > > > > 2. we don't need to worry that "quarantined.txt" having
> > > > out-of-date
> > > > > > > test
> > > > > > > > > names
> > > > > > > > > 3. we can require the flaky annotation must have jira link.
> > > That
> > > > > > means
> > > > > > > > the
> > > > > > > > > PR's author must create the jira link for the new flaky
> > > > > > > > >
> > > > > > > > > Also, we can add a gradle task to generate "quarantined.txt"
> > > file
> > > > > if
> > > > > > > > needs.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Chia-Ping
> > > > > > > > >
> > > > > > > > > David Arthur <mum...@gmail.com> 於 2024年9月19日 週四 上午12:02寫道：
> > > > > > > > >
> > > > > > > > > > Hello, Kafka community!
> > > > > > > > > >
> > > > > > > > > > Looking at the last 7 days of GitHub, we have 59 out of 64
> > > > trunk
> > > > > > > builds
> > > > > > > > > > having flaky tests. Excluding timeouts (a separate issue),
> > > > only 4
> > > > > > > > builds
> > > > > > > > > > out of the last 7 days have failed due to excess test
> > > failures.
> > > > > > This
> > > > > > > is
> > > > > > > > > > actually a slight improvement when compared with the last
> > 28
> > > > > days.
> > > > > > > But
> > > > > > > > > > still, this is obviously a bad situation to be in.
> > > > > > > > > >
> > > > > > > > > > We have previously discussed a few ideas to mitigate the
> > > impact
> > > > > > that
> > > > > > > > > flaky
> > > > > > > > > > tests have on our builds. For PRs, we are actually seeing a
> > > lot
> > > > > of
> > > > > > > > > > successful status checks due to our use of the Develocity
> > > test
> > > > > > retry
> > > > > > > > > > feature. However, the blanket use of "testRetry" is a bad
> > > > > practice
> > > > > > in
> > > > > > > > > > my opinion. It makes it far too easy for us to ignore tests
> > > > that
> > > > > > are
> > > > > > > > only
> > > > > > > > > > occasionally flaky. It also applies to unit tests which
> > > should
> > > > > > never
> > > > > > > be
> > > > > > > > > > flaky.
> > > > > > > > > >
> > > > > > > > > > Another problem is that we are naturally introducing flaky
> > > > tests
> > > > > as
> > > > > > > new
> > > > > > > > > > features (and tests) are introduced. Similar to feature
> > > > > > development,
> > > > > > > it
> > > > > > > > > > takes some time for tests to mature and stabilize -- tests
> > > are
> > > > > > code,
> > > > > > > > > after
> > > > > > > > > > all.
> > > > > > > > > >
> > > > > > > > > > I have written down a proposal for tracking and managing
> > our
> > > > > flaky
> > > > > > > > > tests. I
> > > > > > > > > > have written this as a KIP even though this is an internal
> > > > > change.
> > > > > > I
> > > > > > > > did
> > > > > > > > > so
> > > > > > > > > > because I would like us to discuss, debate, and solidify a
> > > plan
> > > > > --
> > > > > > > and
> > > > > > > > > > ultimately vote on it. A KIP seemed like a good fit.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1090+Flaky+Test+Management
> > > > > > > > > >
> > > > > > > > > > I have back-tested this strategy (as best as I can) to our
> > > > trunk
> > > > > > > builds
> > > > > > > > > > from the last month using data from Develocity (i.e.,
> > > > > > ge.apache.org
> > > > > > > ).
> > > > > > > > I
> > > > > > > > > > looked at two scenarios. The first scenario was simply
> > > > > quarantining
> > > > > > > > tests
> > > > > > > > > > with higher than 1% flaky failures, no test re-runs were
> > > > > > considered.
> > > > > > > > The
> > > > > > > > > > second scenario extends the first by allowing up to 3 total
> > > > flaky
> > > > > > > > > failures
> > > > > > > > > > from non-quarantined tests (tests with less than 1% total
> > > > > > flakiness).
> > > > > > > > > >
> > > > > > > > > > Total builds: *238*
> > > > > > > > > > Flaky/Failed builds: *228*
> > > > > > > > > > Flaky builds scenario 1 (quarantine only): *40*
> > > > > > > > > > Flaky builds scenario 2 (quarantine + retry): *3*
> > > > > > > > > >
> > > > > > > > > > In other words, we can tackle the worst flaky failures with
> > > the
> > > > > > > > > quarantine
> > > > > > > > > > strategy as described in the KIP and handle the long tail
> > of
> > > > > flaky
> > > > > > > > > failures
> > > > > > > > > > with the Develocity retry plugin. If we only had 3 failing
> > > > trunk
> > > > > > > builds
> > > > > > > > > per
> > > > > > > > > > month to investigate, I'd say we were in pretty good shape
> > :)
> > > > > > > > > >
> > > > > > > > > > Let me know what you think!
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > David A
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > David Arthur
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > David Arthur
> > >
> >
>
>
> --
> David Arthur



-- 
-José

Re: [DISCUSS] KIP-1090 Flaky Tests 👻

Reply via email to