Thanks Andrew.
Gijs, if you'd like to see the notes we took in PDX on this topic,
they're here: https://etherpad.mozilla.org/ateam-pdx-intermittent-oranges
Feel free to add more ideas and comments. We're currently working on
our Q1 plan and will see how many of these things we can fit in then.
Jonathan
On 12/9/2014 6:24 AM, Andrew Halberstadt wrote:
We had a session on intermittents in PDX. Additionally we (the ateam)
have had several brainstorming sessions prior to the work week. I'll
try to summarize what we talked about and answer your questions at the
same time in-line.
On 08/12/14 03:52 PM, Gijs Kruitbosch wrote:
1) make it easier to figure out from bugzilla/treeherder when and where
the failure first occurred
- I don't want to know the first thing that got reported to bmo - IME,
that is not always the first time it happened, just the first time it
got filed.
In other words, can I query treeherder in some way (we have structured
logs now right, and all this stuff is in a DB somewhere?) with a test
name and a regex, to have it tell me where the test first failed with a
message matching that regex?
Structured logs have been around for a few months now, but only
recently has mozharness started using them for determining failure
status (and even now only for a few suites).
The next step is absolutely storing this stuff into a DB. Starting now
and into Q1 we'll be creating a prototype to figure out things like
schemas, costs and logistics. Unlike logs, we want to keep this data
forever, so we need to make sure we get it right.
As part of the prototype phase, we plan to answer some simple
questions that don't require lots of historical data. Can we identify
new flaky tests? Can we normalize chunks based on runtime instead of
number of tests?
2) make it easier to figure out from bugzilla/treeherder when and where
the failure happens
3) numbers on how frequently a test fails
I think these both tie into number 1. We aren't sure exactly what the
schema will look like, but tying metadata about the test run into the
results is obviously something we need to do. These questions would
become easy to answer.
We also want to look into cross correlating data from other systems
(e.g bugzilla, orangefactor, ...) into test results. This will likely
be further out though.
4) automate regression hunting (aka mozregression for intermittent
infra-only failures)
Yes, this is explicitly one of the first things we'll be tackling.
Often sheriffs don't have time to go and retrigger backfills, they
shouldn't have to. This sort of but not really depends on the DB
project outlined above.
5) rr or similar recording of failing test runs
We've talked about this before on this newsgroup, but it's been a long
time. Is this feasible and/or currently in the pipeline?
We're aware of rr, but it's not something that has been called out as
something we should do in the short term. My understanding is that
there are still a lot of unknowns, and getting something stood up in
production infrastructure will likely be a large multi-quarter
project. Maybe :roc can clarify here.
I'm not saying we won't do it, it would be awesome, but it seems like
there are easier wins we can make in the meantime.
~ Gijs
Other things that we talked about that might make dealing with
intermittents better:
* dynamic (maybe also static) analysis of new tests to determine
common bad patterns (ehsan has ideas) to be integrated into autoland
or a post-commit hook or some kind of quarantine.
* in-tree chunking/more dynamic test scheduling (ability to schedule
only certain tests). One of the end goals here is for the term
"chunking" to disappear from the point of view of developers.
* c++ code coverage tied into the build system with automatically
updated reports (I'm working on the build integration pieces on the
side).
* automatic filing of intermittents (this is currently what the
sheriffs spend the most time on, fixing this frees them up to better
monitor the tree).
Thanks for caring about the state of intermittents, they've been
neglected for too long. I'm hopeful that 2015 will bring many
improvements in this area. And of course, please let us know if you
have any other ideas or would like to help out.
-Andrew
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform