Re: File JIRAs for all flaky test failures

Sean Owen Thu, 16 Feb 2017 08:46:42 -0800

I'm not sure what you're specifically suggesting. Of course flaky tests are
bad and they should be fixed, and people do. Yes, some are pretty hard to
fix because they are rarely reproducible if at all. If you want to fix,
fix; there's nothing more to it.


I don't perceive flaky tests to be a significant problem. It has gone from
bad to occasional over the past year in my anecdotal experience.

On Thu, Feb 16, 2017 at 4:26 PM Saikat Kanjilal <[email protected]> wrote:

> I'd just like to follow up again on this thread, should we devote some
> energy to fixing unit tests based on module, there wasn't much interest in
> this last time but given the nature of this thread I'd be willing to deep
> dive into this again with some help.
> ------------------------------
> *From:* Saikat Kanjilal <[email protected]>
> *Sent:* Wednesday, February 15, 2017 6:12 PM
> *To:* Josh Rosen
> *Cc:* Armin Braun; Kay Ousterhout; [email protected]
>
> *Subject:* Re: File JIRAs for all flaky test failures
> The issue was not with a lack of tooling, I used the url you are
> describing below to drill down to the exact test failure/stack trace, the
> problem was that my builds would work like a charm locally but fail with
> these errors on Jenkins, this was the whole challenge in fixing the unit
> tests, it was rare (if ever) where I would be able to replicate test
> failures locally.
>
> Sent from my iPhone
>
> On Feb 15, 2017, at 5:40 PM, Josh Rosen <[email protected]> wrote:
>
> A useful tool for investigating test flakiness is my Jenkins Test Explorer
> service, running at https://spark-tests.appspot.com/
>
> This has some useful timeline views for debugging flaky builds. For
> instance, at
> https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may
> be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png.
> Here, each column represents a test run and each row represents a test
> which failed at least once over the displayed time period.
>
> In that linked example screenshot you'll notice that a few columns have
> grey squares indicating that tests were skipped but lack any red squares to
> indicate test failures. This usually indicates that the build failed due to
> a problem other than an individual test failure. For example, I clicked
> into one of those builds and found that one test suite failed in test setup
> because the previous suite had not properly cleaned up its SparkContext
> (I'll file a JIRA for this).
>
> You can click through the interface to drill down to reports on individual
> builds, tests, suites, etc. As an example of an individual test's detail
> page,
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message
>  shows
> the patterns of flakiness in a streaming checkpoint test.
>
> Finally, there's an experimental "interesting new test failures" report
> which tries to surface tests which have started failing very recently:
> https://spark-tests.appspot.com/failed-tests/new. Specifically, entries
> in this feed are test failures which a) occurred in the last week, b) were
> not part of a build which had 20 or more failed tests, and c) were not
> observed to fail in during the previous week (i.e. no failures from [2
> weeks ago, 1 week ago)), and d) which represent the first time that the
> test failed this week (i.e. a test case will appear at most once in the
> results list). I've also exposed this as an RSS feed at
> https://spark-tests.appspot.com/rss/failed-tests/new.
>
>
> On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[email protected]>
> wrote:
>
> I would recommend we just open JIRA's for unit tests based on module
> (core/ml/sql etc) and we fix this one module at a time, this at least keeps
> the number of unit tests needing fixing down to a manageable number.
>
>
> ------------------------------
> *From:* Armin Braun <[email protected]>
> *Sent:* Wednesday, February 15, 2017 12:48 PM
> *To:* Saikat Kanjilal
> *Cc:* Kay Ousterhout; [email protected]
> *Subject:* Re: File JIRAs for all flaky test failures
>
> I think one thing that is contributing to this a lot too is the general
> issue of the tests taking up a lot of file descriptors (10k+ if I run them
> on a standard Debian machine).
> There are a few suits that contribute to this in particular like
> `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others,
> appears to consume a lot of fds.
>
> Wouldn't it make sense to open JIRAs about those and actively try to
> reduce the resource consumption of these tests?
> Seems to me these can cause a lot of unpredictable behavior (making the
> reason for flaky tests hard to identify especially when there's timeouts
> etc. involved) + they make it prohibitively expensive for many to test
> locally imo.
>
> On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[email protected]>
> wrote:
>
> I was working on something to address this a while ago
> https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in
> testing locally made things a lot more complicated to fix for each of the
> unit tests, should we resurface this JIRA again, I would whole heartedly
> agree with the flakiness assessment of the unit tests.
> [SPARK-9487] Use the same num. worker threads in Scala ...
> <https://issues.apache.org/jira/browse/SPARK-9487>
> issues.apache.org
> In Python we use `local[4]` for unit tests, while in Scala/Java we use
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other
> components. If the ...
>
>
>
> ------------------------------
> *From:* Kay Ousterhout <[email protected]>
> *Sent:* Wednesday, February 15, 2017 12:10 PM
> *To:* [email protected]
> *Subject:* File JIRAs for all flaky test failures
>
> Hi all,
>
> I've noticed the Spark tests getting increasingly flaky -- it seems more
> common than not now that the tests need to be re-run at least once on PRs
> before they pass.  This is both annoying and problematic because it makes
> it harder to tell when a PR is introducing new flakiness.
>
> To try to clean this up, I'd propose filing a JIRA *every time* Jenkins
> fails on a PR (for a reason unrelated to the PR).  Just provide a quick
> description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or
> "Tests failed because 250m timeout expired", a link to the failed build,
> and include the "Tests" component.  If there's already a JIRA for the
> issue, just comment with a link to the latest failure.  I know folks don't
> always have time to track down why a test failed, but this it at least
> helpful to someone else who, later on, is trying to diagnose when the issue
> started to find the problematic code / test.
>
> If this seems like too high overhead, feel free to suggest alternative
> ways to make the tests less flaky!
>
> -Kay
>
>
>

Re: File JIRAs for all flaky test failures

Reply via email to