Hello folks,

I’m one of the heavy users of XCTest.measureBlock as it exists in Xcode 7.2.  
To give some hard numbers, I have ~50 performance tests in an OSX framework 
project, occupying about 20m wall clock time total.  This occurs on a 
per-commit basis.

The current implementation of measureBlock as it currently exists in 
closed-source Xcode is something like this:

1.  Run 10 trials
2.  Compare the average across those 10 trials to some baseline
3.  Compare the stdev across those 10 trials to some standard value (10% by 
default)

There are really a lot of problems with this algorithm, but maybe the biggest 
one is how it handles outliers.  If you have a test suite running for 20m, 
chances are “something” is going to happen on the build server in that time.  
System background task, software update, gremlins etc.

So what happens lately is exactly *one* of the 10 * 50 = 500 total 
measureBlocks takes a really long time, and it is a different failure each time 
(e.g., it’s not my code, I swear).  A result like this for some test is typical:




The probability of this kind of error grows exponentially with the test suite 
size.  If we assume for an individual measureBlock that it only fails due to 
“chance” .01% of the time, then the overall test suite at N = 500 will only 
pass 60% of the time.  This is very vaguely consistent with what I experience 
at my scale—e.g. a test suite that does not really tell me if my code is broken 
or not.

IMO the problem here is one of experiment design.  From the data in the 
screenshot, this very well might be a real performance regression that should 
be properly investigated.  It is only when I tell you a lot of extra 
information—e.g. that this test will pass fine the next 100 executions and it’s 
part of an enormous test suite where something is bound to fail—that a failure 
due to random chance seems likely.  In other words, running 10 iterations and 
pretending that will find performance regressions is a poor approach.

I’ve done some prototyping on algorithms that use a dynamically sized number of 
trials to find performance regressions.  Apple employees, see rdar://21315474 
for an algorithm for a sliding window for performance tests (that also has 
other benefits, like measuring nanosecond-scale performance).  I am certainly 
willing to contrib that work in the open if there’s consensus it’s a good 
direction.

However, now that this is happening in the open, I’m interested in getting 
others’ thoughts on this problem.  Surely I am not the only serious user of 
performance tests, and maybe people with better statistics backgrounds than I 
have can suggest an appropriate solution.

Drew
_______________________________________________
swift-corelibs-dev mailing list
swift-corelibs-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-corelibs-dev

Reply via email to