Hi Drew, Thanks for the detailed info on your issue. I see you filed a radar, and that is indeed the best way to make sure an issue on Darwin platforms is addressed. Unfortunately our corelibs implementation of XCTest isn’t ready yet for performance testing.
- Tony > On Dec 10, 2015, at 3:41 AM, Drew Crawford via swift-corelibs-dev > <swift-corelibs-dev@swift.org> wrote: > > Hello folks, > > I’m one of the heavy users of XCTest.measureBlock as it exists in Xcode 7.2. > To give some hard numbers, I have ~50 performance tests in an OSX framework > project, occupying about 20m wall clock time total. This occurs on a > per-commit basis. > > The current implementation of measureBlock as it currently exists in > closed-source Xcode is something like this: > > 1. Run 10 trials > 2. Compare the average across those 10 trials to some baseline > 3. Compare the stdev across those 10 trials to some standard value (10% by > default) > > There are really a lot of problems with this algorithm, but maybe the biggest > one is how it handles outliers. If you have a test suite running for 20m, > chances are “something” is going to happen on the build server in that time. > System background task, software update, gremlins etc. > > So what happens lately is exactly *one* of the 10 * 50 = 500 total > measureBlocks takes a really long time, and it is a different failure each > time (e.g., it’s not my code, I swear). A result like this for some test is > typical: > > <Screen Shot 2015-12-10 at 5.12.13 AM.png> > > > The probability of this kind of error grows exponentially with the test suite > size. If we assume for an individual measureBlock that it only fails due to > “chance” .01% of the time, then the overall test suite at N = 500 will only > pass 60% of the time. This is very vaguely consistent with what I experience > at my scale—e.g. a test suite that does not really tell me if my code is > broken or not. > > IMO the problem here is one of experiment design. From the data in the > screenshot, this very well might be a real performance regression that should > be properly investigated. It is only when I tell you a lot of extra > information—e.g. that this test will pass fine the next 100 executions and > it’s part of an enormous test suite where something is bound to fail—that a > failure due to random chance seems likely. In other words, running 10 > iterations and pretending that will find performance regressions is a poor > approach. > > I’ve done some prototyping on algorithms that use a dynamically sized number > of trials to find performance regressions. Apple employees, see > rdar://21315474 <rdar://21315474> for an algorithm for a sliding window for > performance tests (that also has other benefits, like measuring > nanosecond-scale performance). I am certainly willing to contrib that work > in the open if there’s consensus it’s a good direction. > > However, now that this is happening in the open, I’m interested in getting > others’ thoughts on this problem. Surely I am not the only serious user of > performance tests, and maybe people with better statistics backgrounds than I > have can suggest an appropriate solution. > > Drew > > _______________________________________________ > swift-corelibs-dev mailing list > swift-corelibs-dev@swift.org > https://lists.swift.org/mailman/listinfo/swift-corelibs-dev
_______________________________________________ swift-corelibs-dev mailing list swift-corelibs-dev@swift.org https://lists.swift.org/mailman/listinfo/swift-corelibs-dev