On 12-08-29 9:20 PM, Dave Mandelin wrote:
On Wednesday, August 29, 2012 4:03:24 PM UTC-7, Ehsan Akhgari wrote:
Hi everyone,

The way the current situation happens is that many of the developers
ignore the Talos regression emails that go to dev-tree-management,

Talos is widely disliked and distrusted by developers, because it's hard to 
understand what it's really measuring, and there are lots of false alarms. 
Metrics and A-Team have been doing a ton of work to improve this. In 
particular, I told them that some existing Talos JS tests were not useful to 
us, and they deleted them. And v2 is going to have exactly the tests we want, 
with regression alarms. So Talos can (and will) be fixed for developers.

In my opinion, one of the reasons why Talos is disliked is because many people don't know where its code lives (hint: http://hg.mozilla.org/build/talos/) and can't run those tests like other test suites. I think this would be very valuable to fix, so that developers can read Talos tests like any other test, and fix or improve them where needed.

Some people have noted in the past that some Talos measurements are not
representative of something that the users would see, the Talos numbers
are noisy, and we don't have good tools to deal with these types of
regressions.  There might be some truth to all of these, but I believe
that the bigger problem is that nobody owns watching over these numbers,
and as a result as take regressions in some benchmarks which can
actually be representative of what our users experience.

The interesting thing is that we basically have no idea if that's true for any 
given Talos alarm.

That's something that I think should be judged per benchmark. For example, the Ts measurements will probably correspond very directly to the startup time that our users experience. The Tp5 measurements don't directly correspond to anything like that, since nobody loads those pages sequentially, but it could be an indication of average page load performance.

I don't believe that the current situation is acceptable, especially
with the recent focus on performance (through the Snappy project), and I
would like to ask people if they have any ideas on what we can do to fix
this.  The fix might be turning off some Talos tests if they're really
not useful, asking someone or a group of people to go over these test
results, get better tools with them, etc.  But _something_ needs to
happen here.

I would say:

- First, and most important, fix the test suite so that it measures only things 
that are useful and meaningful to developers and users. We can easily take a 
first cut at this if engineering teams go over the tests related to their work, 
and tell A-Team which are not useful. Over time, I think we need to get a solid 
understanding of what performance looks like to users, what things to test, and 
how to test them soundly. This may require dedicated performance engineers or a 
performance product manager.

Absolutely. I think developers need to act more proactively to address this. I've heard so many times that the measurement X is useless. I think it's time for us to even consider stopping some of the Talos tests if we don't think they're useful. We could use the machine time to run other tests at least!

- Second, as you say, get an owner for performance regressions. There are lots 
of ways we could do this. I think it would integrate fairly easily into our 
existing processes if we (automatically or by a designated person) filed a bug 
for each regression and marked it tracking (so the release managers would own 
followup). Alternately, we could have a designated person own followup. I'm not 
sure if that has any advantages, but release managers would probably know. But 
doing any of this is going to severely annoy engineers unless we get the false 
positive rate under control.

Note that some of the work of to differentiate between false positives and real regressions needs to be done by the engineers, similar to the work required to investigate correctness problems. And people need to accept that seemingly benign changes may also cause real performance regressions, so it's not always possible to glance over a changeset and say "nah, this can't be my fault." :-)

- Speaking of false positives, we should seriously start tracking them. We 
should keep track of each Talos regression found and its outcome. (It would be 
great to track false negatives too but it's a lot harder to catch them and 
record them accurately.) That way we'd actually know whether we have a few 
false positives or a lot, or whether the false positives were coming up on 
certain tests. And we could use that information to improve the false positive 
rate over time.

I agree.  Do you have any suggestions on how we would track them?

Thanks!
Ehsan

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to