Re: The current state of Talos benchmarks

Dave Mandelin Thu, 30 Aug 2012 14:30:20 -0700

On Thursday, August 30, 2012 9:11:25 AM UTC-7, Ehsan Akhgari wrote:
> On 12-08-29 9:20 PM, Dave Mandelin wrote:
> 
> > On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
> 
> In my opinion, one of the reasons why Talos is disliked is because many 
> people don't know where its code lives (hint: 
> http://hg.mozilla.org/build/talos/) and can't run those tests like other 
> test suites.  I think this would be very valuable to fix, so that 
> developers can read Talos tests like any other test, and fix or improve 
> them where needed.


It is hard to find. And beyond that, it seems hard to use. It's been a while 
since I've run Talos locally, but last time I did it was a pain to set up and 
difficult to run, and I hear it's still kind of like that. For testing tools, 
"convenient for the developer" is a critical requirement, but has been 
neglected in the past. 

js/src/jit-test/ is an example of something that is very convenient for 
developers: creating a test is just adding a .js file to a directory (no 
manifest or extra files; by default error or crash is a fail, but you can 
change that for a test), the harness is a Python file with nice options, the 
test configuration and basic usage is documented in a README, and it lives in 
the tree.

> >> [...] I believe
> >> that the bigger problem is that nobody owns watching over these numbers,
> >> and as a result as take regressions in some benchmarks which can
> >> actually be representative of what our users experience.
> 
> > The interesting thing is that we basically have no idea if that's true for 
> > any given Talos alarm.
> 
> That's something that I think should be judged per benchmark.  For 
> example, the Ts measurements will probably correspond very directly to 
> the startup time that our users experience.  The Tp5 measurements don't 
> directly correspond to anything like that, since nobody loads those 
> pages sequentially, but it could be an indication of average page load 
> performance.

I exaggerated a bit--yes, some tests like Ts are pretty easy to understand and 
do correspond to user experience. With Tp5, I just don't know--I haven't spent 
any time trying to use it or looking at regressions, since JS doesn't affect it.

> >> I don't believe that the current situation is acceptable, especially
> >> with the recent focus on performance (through the Snappy project), and I
> >> would like to ask people if they have any ideas on what we can do to fix
> >> this.  The fix might be turning off some Talos tests if they're really
> >> not useful, asking someone or a group of people to go over these test
> >> results, get better tools with them, etc.  But _something_ needs to
> >> happen here.


> > - Second, as you say, get an owner for performance regressions. There are 
> > lots of ways we could do this. I think it would integrate fairly easily 
> > into our existing processes if we (automatically or by a designated person) 
> > filed a bug for each regression and marked it tracking (so the release 
> > managers would own followup). Alternately, we could have a designated 
> > person own followup. I'm not sure if that has any advantages, but release 
> > managers would probably know. But doing any of this is going to severely 
> > annoy engineers unless we get the false positive rate under control.
> 
> Note that some of the work of to differentiate between false positives 
> and real regressions needs to be done by the engineers, similar to the 
> work required to investigate correctness problems.  And people need to 
> accept that seemingly benign changes may also cause real performance 
> regressions, so it's not always possible to glance over a changeset and 
> say "nah, this can't be my fault."  :-)

Agreed.

> > - Speaking of false positives, we should seriously start tracking them. We 
> > should keep track of each Talos regression found and its outcome. (It would 
> > be great to track false negatives too but it's a lot harder to catch them 
> > and record them accurately.) That way we'd actually know whether we have a 
> > few false positives or a lot, or whether the false positives were coming up 
> > on certain tests. And we could use that information to improve the false 
> > positive rate over time.
> 
> I agree.  Do you have any suggestions on how we would track them?

The details would vary according to the preferences of the person doing it, but 
I'd sketch it out something like this: when Talos detects a regression, file a 
bug to "resolve" it (i.e., show that it's not a real regression, show that it's 
an acceptable regression for the patch, or fix the regression). Then keep a 
file listing those bugs (with metadata for each: tests regressed, date, 
component, etc), and as each is closed, mark down the result: false positive, 
allowed, backed out, or fixed. That's your data set. Of course, various parts 
of this could be automated but that's not required.

Dave
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: The current state of Talos benchmarks

Reply via email to