Re: Next steps with flickering tests

Jens Deppe Tue, 26 Apr 2016 13:42:49 -0700

By running the Flakes with forkEvery 1 won't it extend precheckin by a fair
bit? I'd prefer to see two separate builds running.


On Tue, Apr 26, 2016 at 11:53 AM, Kirk Lund <[email protected]> wrote:

> I'm in favor of running the FlakyTests together at the end of precheckin
> using forkEvery 1 on them too.
>
> What about running two nightly builds? One that runs all the non-flaky
> UnitTests, IntegrationTests and DistributedTests. Plus another nightly
> build that runs only FlakyTests? We can run Jenkins jobs on our local
> machines that separates FlakyTests out into its own job too, but I'd like
> to see the main nightly build go to 100% green (if that's even possible
> without encounter many more flickering tests).
>
> -Kirk
>
>
> On Tue, Apr 26, 2016 at 11:02 AM, Dan Smith <[email protected]> wrote:
>
> > +1 for separating these out and running them with forkEvery 1.
> >
> > I think they should probably still run as part of precheckin and the
> > nightly builds though. We don't want this to turn into essentially
> > disabling and ignoring these tests.
> >
> > -Dan
> >
> > On Tue, Apr 26, 2016 at 10:28 AM, Kirk Lund <[email protected]> wrote:
> > > Also, I don't think there's much value continuing to use the "CI"
> label.
> > If
> > > a test fails in Jenkins, then run the test to see if it fails
> > consistently.
> > > If it doesn't, it's flaky. The developer looking at it should try to
> > > determine the cause of it failing (ie, "it uses thread sleeps or random
> > > ports with BindExceptions or has short timeouts with probable GC
> pause")
> > > and include that info when adding the FlakyTest annotation and filing a
> > > Jira bug with the Flaky label. If the test fails consistently, then
> file
> > a
> > > Jira bug without the Flaky label.
> > >
> > > -Kirk
> > >
> > >
> > > On Tue, Apr 26, 2016 at 10:24 AM, Kirk Lund <[email protected]> wrote:
> > >
> > >> There are quite a few test classes that have multiple test methods
> which
> > >> are annotated with the FlakyTest category.
> > >>
> > >> More thoughts:
> > >>
> > >> In general, I think that if any given test fails intermittently then
> it
> > is
> > >> a FlakyTest. A good test should either pass or fail consistently.
> After
> > >> annotating a test method with FlakyTest, the developer should then add
> > the
> > >> Flaky label to corresponding Jira ticket. What we then do with the
> Jira
> > >> tickets (ie, fix them) is probably more important than deciding if a
> > test
> > >> is flaky or not.
> > >>
> > >> Rather than try to come up with some flaky process for determining if
> a
> > >> given test is flaky (ie, "does it have thread sleeps?"), it would be
> > better
> > >> to have a wiki page that has examples of flakiness and how to fix them
> > ("if
> > >> the test has thread sleeps, then switch to using Awaitility and do
> > >> this...").
> > >>
> > >> -Kirk
> > >>
> > >>
> > >> On Mon, Apr 25, 2016 at 10:51 PM, Anthony Baker <[email protected]>
> > wrote:
> > >>
> > >>> Thanks Kirk!
> > >>>
> > >>> ~/code/incubator-geode (develop)$ grep -ro "FlakyTest.class" . | grep
> > -v
> > >>> Binary | wc -l | xargs echo "Flake factor:"
> > >>> Flake factor: 136
> > >>>
> > >>> Anthony
> > >>>
> > >>>
> > >>> > On Apr 25, 2016, at 9:45 PM, William Markito <[email protected]>
> > >>> wrote:
> > >>> >
> > >>> > +1
> > >>> >
> > >>> > Are we also planning to automate the additional build task somehow
> ?
> > >>> >
> > >>> > I'd also suggest creating a wiki page with some stats (like how
> many
> > >>> > FlakyTests we currently have) and the idea behind this effort so we
> > can
> > >>> > keep track and see how it's evolving over time.
> > >>> >
> > >>> > On Mon, Apr 25, 2016 at 6:54 PM, Kirk Lund <[email protected]>
> wrote:
> > >>> >
> > >>> >> After completing GEODE-1233, all currently known flickering tests
> > are
> > >>> now
> > >>> >> annotated with our FlakyTest JUnit Category.
> > >>> >>
> > >>> >> In an effort to divide our build up into multiple build pipelines
> > that
> > >>> are
> > >>> >> sequential and dependable, we could consider excluding FlakyTests
> > from
> > >>> the
> > >>> >> primary integrationTest and distributedTest tasks. An additional
> > build
> > >>> task
> > >>> >> would then execute all of the FlakyTests separately. This would
> > >>> hopefully
> > >>> >> help us get to a point where we can depend on our primary testing
> > tasks
> > >>> >> staying green 100% of the time. We would then prioritize fixing
> the
> > >>> >> FlakyTests and one by one removing the FlakyTest category from
> them.
> > >>> >>
> > >>> >> I would also suggest that we execute the FlakyTests with
> "forkEvery
> > 1"
> > >>> to
> > >>> >> give each test a clean JVM or set of DistributedTest JVMs. That
> > would
> > >>> >> hopefully decrease the chance of a GC pause or test pollution
> > causing
> > >>> >> flickering failures.
> > >>> >>
> > >>> >> Having reviewed lots of test code and failure stacks, I believe
> that
> > >>> the
> > >>> >> primary causes of FlakyTests are timing sensitivity (thread sleeps
> > or
> > >>> >> nothing that waits for async activity, timeouts or sleeps that are
> > >>> >> insufficient on busy CPU or I/O or during due GC pause) and random
> > >>> ports
> > >>> >> via AvailablePort (instead of using zero for ephemeral port).
> > >>> >>
> > >>> >> Opinions or ideas? Hate it? Love it?
> > >>> >>
> > >>> >> -Kirk
> > >>> >>
> > >>> >
> > >>> >
> > >>> >
> > >>> > --
> > >>> >
> > >>> > ~/William
> > >>>
> > >>>
> > >>
> >
>

Re: Next steps with flickering tests

Reply via email to