Re: File JIRAs for all flaky test failures

Saikat Kanjilal Thu, 16 Feb 2017 08:34:42 -0800

Reynold,

Its not one issue , I encountered multiple issues (stack traces/exceptions etc) 
where the issue only occured on Jenkins but not on my local environments, I 
would have to dig up all those old unit tests to list them all [😊]  and I'm not 
willing to do that unless we deem this to be an actual problem that we want to 
spend time and energy to fix.

Thanks

________________________________
From: Reynold Xin <r...@databricks.com>
Sent: Thursday, February 16, 2017 8:27 AM
To: Saikat Kanjilal
Cc: dev@spark.apache.org
Subject: Re: File JIRAs for all flaky test failures

What exactly is the issue? I've been working on Spark dev for a long time and 
very rarely do I actually run into an issue that only manifest on Jenkins but 
not locally. I don't have some magic local setup either.

We should definitely cut down test flakiness.

On Thu, Feb 16, 2017 at 5:26 PM, Saikat Kanjilal 
<sxk1...@hotmail.com<mailto:sxk1...@hotmail.com>> wrote:

I'd just like to follow up again on this thread, should we devote some energy 
to fixing unit tests based on module, there wasn't much interest in this last 
time but given the nature of this thread I'd be willing to deep dive into this 
again with some help.

________________________________
From: Saikat Kanjilal <sxk1...@hotmail.com<mailto:sxk1...@hotmail.com>>
Sent: Wednesday, February 15, 2017 6:12 PM
To: Josh Rosen
Cc: Armin Braun; Kay Ousterhout; 
dev@spark.apache.org<mailto:dev@spark.apache.org>

Subject: Re: File JIRAs for all flaky test failures

The issue was not with a lack of tooling, I used the url you are describing 
below to drill down to the exact test failure/stack trace, the problem was that 
my builds would work like a charm locally but fail with these errors on 
Jenkins, this was the whole challenge in fixing the unit tests, it was rare (if 
ever) where I would be able to replicate test failures locally.

Sent from my iPhone

On Feb 15, 2017, at 5:40 PM, Josh Rosen 
<joshro...@databricks.com<mailto:joshro...@databricks.com>> wrote:

A useful tool for investigating test flakiness is my Jenkins Test Explorer 
service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, 
at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may 
be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, 
each column represents a test run and each row represents a test which failed 
at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey 
squares indicating that tests were skipped but lack any red squares to indicate 
test failures. This usually indicates that the build failed due to a problem 
other than an individual test failure. For example, I clicked into one of those 
builds and found that one test suite failed in test setup because the previous 
suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual 
builds, tests, suites, etc. As an example of an individual test's detail page, 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message
 shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which 
tries to surface tests which have started failing very recently: 
https://spark-tests.appspot.com/failed-tests/new. Specifically, entries in this 
feed are test failures which a) occurred in the last week, b) were not part of 
a build which had 20 or more failed tests, and c) were not observed to fail in 
during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and 
d) which represent the first time that the test failed this week (i.e. a test 
case will appear at most once in the results list). I've also exposed this as 
an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.

On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal 
<sxk1...@hotmail.com<mailto:sxk1...@hotmail.com>> wrote:

I would recommend we just open JIRA's for unit tests based on module 
(core/ml/sql etc) and we fix this one module at a time, this at least keeps the 
number of unit tests needing fixing down to a manageable number.

________________________________
From: Armin Braun <m...@obrown.io<mailto:m...@obrown.io>>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: File JIRAs for all flaky test failures

I think one thing that is contributing to this a lot too is the general issue 
of the tests taking up a lot of file descriptors (10k+ if I run them on a 
standard Debian machine).
There are a few suits that contribute to this in particular like 
`org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, 
appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the 
resource consumption of these tests?
Seems to me these can cause a lot of unpredictable behavior (making the reason 
for flaky tests hard to identify especially when there's timeouts etc. 
involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal 
<sxk1...@hotmail.com<mailto:sxk1...@hotmail.com>> wrote:

I was working on something to address this a while ago 
https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing 
locally made things a lot more complicated to fix for each of the unit tests, 
should we resurface this JIRA again, I would whole heartedly agree with the 
flakiness assessment of the unit tests.

[SPARK-9487] Use the same num. worker threads in Scala 
...<https://issues.apache.org/jira/browse/SPARK-9487>
issues.apache.org<http://issues.apache.org>
In Python we use `local[4]` for unit tests, while in Scala/Java we use 
`local[2]` and `local` for some unit tests in SQL, MLLib, and other components. 
If the ...

________________________________
From: Kay Ousterhout <kayousterh...@gmail.com<mailto:kayousterh...@gmail.com>>
Sent: Wednesday, February 15, 2017 12:10 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: File JIRAs for all flaky test failures

Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common 
than not now that the tests need to be re-run at least once on PRs before they 
pass.  This is both annoying and problematic because it makes it harder to tell 
when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails 
on a PR (for a reason unrelated to the PR).  Just provide a quick description 
of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed 
because 250m timeout expired", a link to the failed build, and include the 
"Tests" component.  If there's already a JIRA for the issue, just comment with 
a link to the latest failure.  I know folks don't always have time to track 
down why a test failed, but this it at least helpful to someone else who, later 
on, is trying to diagnose when the issue started to find the problematic code / 
test.

If this seems like too high overhead, feel free to suggest alternative ways to 
make the tests less flaky!

-Kay

Re: File JIRAs for all flaky test failures

Reply via email to