Re: [DISCUSS] Releasable trunk and quality

Joshua McKenzie Fri, 05 Nov 2021 08:30:32 -0700

To checkpoint this conversation and keep it going, the ideas I see
in-thread (light editorializing by me):
1. Blocking PR merge on CI being green (viable for single branch commits,
less so for multiple)
2. A change in our expected culture of "if you see something, fix
something" when it comes to test failures on a branch (requires stable
green test board to be viable)
3. Clearer merge criteria and potentially updates to circle config for
committers in terms of "which test suites need to be run" (notably,
including upgrade tests)
4. Integration of model and property based fuzz testing into the release
qualification pipeline at least
5. Improvements in project dependency management, most notably in-jvm dtest
API's, and the release process around that


So a) Am I missing anything, and b) Am I getting anything wrong in the
summary above?

On Thu, Nov 4, 2021 at 9:01 AM Andrés de la Peña <adelap...@apache.org>
wrote:

> Hi all,
>
> we already have a way to confirm flakiness on circle by running the test
> > repeatedly N times. Like 100 or 500. That has proven to work very well
> > so far, at least for me. #collaborating #justfyi
>
>
> I think it would be helpful if we always ran the repeated test jobs at
> CircleCI when we add a new test or modify an existing one. Running those
> jobs, when applicable, could be a requirement before committing. This
> wouldn't help us when the changes affect many different tests or we are not
> able to identify the tests affected by our changes, but I think it could
> have prevented many of the recently fixed flakies.
>
>
> On Thu, 4 Nov 2021 at 12:24, Joshua McKenzie <jmcken...@apache.org> wrote:
>
> > >
> > > we noticed CI going from a
> > > steady 3-ish failures to many and it's getting fixed. So we're moving
> in
> > > the right direction imo.
> > >
> > An observation about this: there's tooling and technology widely in use
> to
> > help prevent ever getting into this state (to Benedict's point: blocking
> > merge on CI failure, or nightly tests and reverting regression commits,
> > etc). I think there's significant time and energy savings for us in using
> > automation to be proactive about the quality of our test boards rather
> than
> > reactive.
> >
> > I 100% agree that it's heartening to see that the quality of the codebase
> > is improving as is the discipline / attentiveness of our collective
> > culture. That said, I believe we still have a pretty fragile system when
> it
> > comes to test failure accumulation.
> >
> > On Thu, Nov 4, 2021 at 2:46 AM Berenguer Blasi <berenguerbl...@gmail.com
> >
> > wrote:
> >
> > > I agree with David. CI has been pretty reliable besides the random
> > > jenkins going down or timeout. The same 3 or 4 tests were the only
> flaky
> > > ones in jenkins and Circle was very green. I bisected a couple failures
> > > to legit code errors, David is fixing some more, others have as well,
> etc
> > >
> > > It is good news imo as we're just getting to learn our CI post 4.0 is
> > > reliable and we need to start treating it as so and paying attention to
> > > it's reports. Not perfect but reliable enough it would have prevented
> > > those bugs getting merged.
> > >
> > > In fact we're having this conversation bc we noticed CI going from a
> > > steady 3-ish failures to many and it's getting fixed. So we're moving
> in
> > > the right direction imo.
> > >
> > > On 3/11/21 19:25, David Capwell wrote:
> > > >> It’s hard to gate commit on a clean CI run when there’s flaky tests
> > > > I agree, this is also why so much effort was done in 4.0 release to
> > > remove as much as possible.  Just over 1 month ago we were not really
> > > having a flaky test issue (outside of the sporadic timeout issues; my
> > > circle ci runs were green constantly), and now the “flaky tests” I see
> > are
> > > all actual bugs (been root causing 2 out of the 3 I reported) and some
> > (not
> > > all) of the flakyness was triggered by recent changes in the past
> month.
> > > >
> > > > Right now people do not believe the failing test is caused by their
> > > patch and attribute to flakiness, which then causes the builds to start
> > > being flaky, which then leads to a different author coming to fix the
> > > issue; this behavior is what I would love to see go away.  If we find a
> > > flaky test, we should do the following
> > > >
> > > > 1) has it already been reported and who is working to fix?  Can we
> > block
> > > this patch on the test being fixed?  Flaky tests due to timing issues
> > > normally are resolved very quickly, real bugs take longer.
> > > > 2) if not reported, why?  If you are the first to see this issue than
> > > good chance the patch caused the issue so should root cause.  If you
> are
> > > not the first to see it, why did others not report it (we tend to be
> good
> > > about this, even to the point Brandon has to mark the new tickets as
> > dups…)?
> > > >
> > > > I have committed when there were flakiness, and I have caused
> > flakiness;
> > > not saying I am perfect or that I do the above, just saying that if we
> > all
> > > moved to the above model we could start relying on CI.  The biggest
> > impact
> > > to our stability is people actually root causing flaky tests.
> > > >
> > > >>  I think we're going to need a system that
> > > >> understands the difference between success, failure, and timeouts
> > > >
> > > > I am curious how this system can know that the timeout is not an
> actual
> > > failure.  There was a bug in 4.0 with time serialization in message,
> > which
> > > would cause the message to get dropped; this presented itself as a
> > timeout
> > > if I remember properly (Jon Meredith or Yifan Cai fixed this bug I
> > believe).
> > > >
> > > >> On Nov 3, 2021, at 10:56 AM, Brandon Williams <dri...@gmail.com>
> > wrote:
> > > >>
> > > >> On Wed, Nov 3, 2021 at 12:35 PM bened...@apache.org <
> > > bened...@apache.org> wrote:
> > > >>> The largest number of test failures turn out (as pointed out by
> > David)
> > > to be due to how arcane it was to trigger the full test suite.
> Hopefully
> > we
> > > can get on top of that, but I think a significant remaining issue is a
> > lack
> > > of trust in the output of CI. It’s hard to gate commit on a clean CI
> run
> > > when there’s flaky tests, and it doesn’t take much to misattribute one
> > > failing test to the existing flakiness (I tend to compare to a run of
> the
> > > trunk baseline for comparison, but this is burdensome and still error
> > > prone). The more flaky tests there are the more likely this is.
> > > >>>
> > > >>> This is in my opinion the real cost of flaky tests, and it’s
> probably
> > > worth trying to crack down on them hard if we can. It’s possible the
> > > Simulator may help here, when I finally finish it up, as we can port
> > flaky
> > > tests to run with the Simulator and the failing seed can then be
> explored
> > > deterministically (all being well).
> > > >> I totally agree that the lack of trust is a driving problem here,
> even
> > > >> in knowing which CI system to rely on. When Jenkins broke but Circle
> > > >> was fine, we all assumed it was a problem with Jenkins, right up
> until
> > > >> Circle also broke.
> > > >>
> > > >> In testing a distributed system like this I think we're always going
> > > >> to have failures, even on non-flaky tests, simply because the
> > > >> underlying infrastructure is variable with transient failures of its
> > > >> own (the network is reliable!)  We can fix the flakies where the
> fault
> > > >> is in the code (and we've done this to many already) but to get more
> > > >> trustworthy output, I think we're going to need a system that
> > > >> understands the difference between success, failure, and timeouts,
> and
> > > >> in the latter case knows how to at least mark them differently.
> > > >> Simulator may help, as do the in-jvm dtests, but there is ultimately
> > > >> no way to cover everything without doing some things the hard, more
> > > >> realistic way where sometimes shit happens, marring the
> almost-perfect
> > > >> runs with noisy doubt, which then has to be sifted through to
> > > >> determine if there was a real issue.
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >>
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> > >
> >
>

Re: [DISCUSS] Releasable trunk and quality

Reply via email to