Thanks Andrew.

Gijs, if you'd like to see the notes we took in PDX on this topic, they're here: https://etherpad.mozilla.org/ateam-pdx-intermittent-oranges

Feel free to add more ideas and comments. We're currently working on our Q1 plan and will see how many of these things we can fit in then.

Jonathan

On 12/9/2014 6:24 AM, Andrew Halberstadt wrote:
We had a session on intermittents in PDX. Additionally we (the ateam) have had several brainstorming sessions prior to the work week. I'll try to summarize what we talked about and answer your questions at the same time in-line.

On 08/12/14 03:52 PM, Gijs Kruitbosch wrote:
1) make it easier to figure out from bugzilla/treeherder when and where
the failure first occurred
- I don't want to know the first thing that got reported to bmo - IME,
that is not always the first time it happened, just the first time it
got filed.

In other words, can I query treeherder in some way (we have structured
logs now right, and all this stuff is in a DB somewhere?) with a test
name and a regex, to have it tell me where the test first failed with a
message matching that regex?

Structured logs have been around for a few months now, but only recently has mozharness started using them for determining failure status (and even now only for a few suites).

The next step is absolutely storing this stuff into a DB. Starting now and into Q1 we'll be creating a prototype to figure out things like schemas, costs and logistics. Unlike logs, we want to keep this data forever, so we need to make sure we get it right.

As part of the prototype phase, we plan to answer some simple questions that don't require lots of historical data. Can we identify new flaky tests? Can we normalize chunks based on runtime instead of number of tests?


2) make it easier to figure out from bugzilla/treeherder when and where
the failure happens

3) numbers on how frequently a test fails

I think these both tie into number 1. We aren't sure exactly what the schema will look like, but tying metadata about the test run into the results is obviously something we need to do. These questions would become easy to answer.

We also want to look into cross correlating data from other systems (e.g bugzilla, orangefactor, ...) into test results. This will likely be further out though.


4) automate regression hunting (aka mozregression for intermittent
infra-only failures)

Yes, this is explicitly one of the first things we'll be tackling. Often sheriffs don't have time to go and retrigger backfills, they shouldn't have to. This sort of but not really depends on the DB project outlined above.


5) rr or similar recording of failing test runs

We've talked about this before on this newsgroup, but it's been a long
time. Is this feasible and/or currently in the pipeline?

We're aware of rr, but it's not something that has been called out as something we should do in the short term. My understanding is that there are still a lot of unknowns, and getting something stood up in production infrastructure will likely be a large multi-quarter project. Maybe :roc can clarify here.

I'm not saying we won't do it, it would be awesome, but it seems like there are easier wins we can make in the meantime.


~ Gijs

Other things that we talked about that might make dealing with intermittents better:

* dynamic (maybe also static) analysis of new tests to determine common bad patterns (ehsan has ideas) to be integrated into autoland or a post-commit hook or some kind of quarantine.

* in-tree chunking/more dynamic test scheduling (ability to schedule only certain tests). One of the end goals here is for the term "chunking" to disappear from the point of view of developers.

* c++ code coverage tied into the build system with automatically updated reports (I'm working on the build integration pieces on the side).

* automatic filing of intermittents (this is currently what the sheriffs spend the most time on, fixing this frees them up to better monitor the tree).

Thanks for caring about the state of intermittents, they've been neglected for too long. I'm hopeful that 2015 will bring many improvements in this area. And of course, please let us know if you have any other ideas or would like to help out.

-Andrew
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to