Hello folks,
I’m one of the heavy users of XCTest.measureBlock as it exists in Xcode 7.2.
To give some hard numbers, I have ~50 performance tests in an OSX framework
project, occupying about 20m wall clock time total. This occurs on a
per-commit basis.
The current implementation of measureBlock as it currently exists in
closed-source Xcode is something like this:
1. Run 10 trials
2. Compare the average across those 10 trials to some baseline
3. Compare the stdev across those 10 trials to some standard value (10% by
default)
There are really a lot of problems with this algorithm, but maybe the biggest
one is how it handles outliers. If you have a test suite running for 20m,
chances are “something” is going to happen on the build server in that time.
System background task, software update, gremlins etc.
So what happens lately is exactly *one* of the 10 * 50 = 500 total
measureBlocks takes a really long time, and it is a different failure each time
(e.g., it’s not my code, I swear). A result like this for some test is typical:
The probability of this kind of error grows exponentially with the test suite
size. If we assume for an individual measureBlock that it only fails due to
“chance” .01% of the time, then the overall test suite at N = 500 will only
pass 60% of the time. This is very vaguely consistent with what I experience
at my scale—e.g. a test suite that does not really tell me if my code is broken
or not.
IMO the problem here is one of experiment design. From the data in the
screenshot, this very well might be a real performance regression that should
be properly investigated. It is only when I tell you a lot of extra
information—e.g. that this test will pass fine the next 100 executions and it’s
part of an enormous test suite where something is bound to fail—that a failure
due to random chance seems likely. In other words, running 10 iterations and
pretending that will find performance regressions is a poor approach.
I’ve done some prototyping on algorithms that use a dynamically sized number of
trials to find performance regressions. Apple employees, see rdar://21315474
for an algorithm for a sliding window for performance tests (that also has
other benefits, like measuring nanosecond-scale performance). I am certainly
willing to contrib that work in the open if there’s consensus it’s a good
direction.
However, now that this is happening in the open, I’m interested in getting
others’ thoughts on this problem. Surely I am not the only serious user of
performance tests, and maybe people with better statistics backgrounds than I
have can suggest an appropriate solution.
Drew
_______________________________________________
swift-corelibs-dev mailing list
swift-corelibs-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-corelibs-dev