On 1/31/2013 12:05 PM, Dave Mandelin wrote:
On Thursday, January 31, 2013 9:17:44 AM UTC-8, Joshua Cranmer wrote:
For what it's worth, reading
<https://bugzilla.mozilla.org/show_bug.cgi?id=833890>, I do not get
the impression that dmandelin "proved" otherwise. His startup tests
have very low statistical confidence (n=2, n=3), and someone who
disclaims his own findings. It may be evidence that PGO is not a Ts
win, but it is weak evidence at best.
I could certainly run a larger number of trials to see what happens. In that case, I
stopped because the min values for warm startup were about equal (and also happened to be
about equal to other warm startup times I had measured recently). For many timed
benchmarks, "base value + positive random noise" seems like a good model, in
which case mins seem like good things to compare.
From a statistical hypothesis testing perspective, I think (I haven't
actually done the math) that the given data is unable to reject either
the hypothesis that PGO gives a benefit on startup time or the
hypothesis that it does not. Mostly, I was cringing at ehsan's statement
that your results "proved" the hypothesis. About what the best
statistical criteria are, I don't wish to argue here.
Our Talos results may be measuring imperfect things, but we have
enough datapoints that we can draw statistical conclusions from
them confidently.
Statistics doesn't help if you're measuring the wrong things. Whether Ts is
measuring the wrong thing, I don't know. It would be possible to learn
something about that question by measuring startup with a camera, Telemetry
simple measures, and Talos on the same machine and seeing how they compare.
I should clarify my previous statement: I want to avoid confirmation
bias in this decision. The proper way to do that is to lay out all the
criterion for acceptance or rejection before you run experiments and
measure the results. This, obviously, is impossible at this point, since
we have a mountain of data which has already biased our thought processes.
By the way, there is a project (in a very early phase now) to do accurate
measurements of startup time, both cold and warm, on machines that model user
hardware, etc.
This is really starting to get off-topic, but I do think we need clear
guidelines on evaluating performance results, which includes things like
ensuring proper statistical testing on results, etc.
dev-platform mailing list