Hi Drew,

Thanks for the detailed info on your issue. I see you filed a radar, and that 
is indeed the best way to make sure an issue on Darwin platforms is addressed. 
Unfortunately our corelibs implementation of XCTest isn’t ready yet for 
performance testing.

- Tony

> On Dec 10, 2015, at 3:41 AM, Drew Crawford via swift-corelibs-dev 
> <swift-corelibs-dev@swift.org> wrote:
> 
> Hello folks,
> 
> I’m one of the heavy users of XCTest.measureBlock as it exists in Xcode 7.2.  
> To give some hard numbers, I have ~50 performance tests in an OSX framework 
> project, occupying about 20m wall clock time total.  This occurs on a 
> per-commit basis.
> 
> The current implementation of measureBlock as it currently exists in 
> closed-source Xcode is something like this:
> 
> 1.  Run 10 trials
> 2.  Compare the average across those 10 trials to some baseline
> 3.  Compare the stdev across those 10 trials to some standard value (10% by 
> default)
> 
> There are really a lot of problems with this algorithm, but maybe the biggest 
> one is how it handles outliers.  If you have a test suite running for 20m, 
> chances are “something” is going to happen on the build server in that time.  
> System background task, software update, gremlins etc.
> 
> So what happens lately is exactly *one* of the 10 * 50 = 500 total 
> measureBlocks takes a really long time, and it is a different failure each 
> time (e.g., it’s not my code, I swear).  A result like this for some test is 
> typical:
> 
> <Screen Shot 2015-12-10 at 5.12.13 AM.png>
> 
> 
> The probability of this kind of error grows exponentially with the test suite 
> size.  If we assume for an individual measureBlock that it only fails due to 
> “chance” .01% of the time, then the overall test suite at N = 500 will only 
> pass 60% of the time.  This is very vaguely consistent with what I experience 
> at my scale—e.g. a test suite that does not really tell me if my code is 
> broken or not.
> 
> IMO the problem here is one of experiment design.  From the data in the 
> screenshot, this very well might be a real performance regression that should 
> be properly investigated.  It is only when I tell you a lot of extra 
> information—e.g. that this test will pass fine the next 100 executions and 
> it’s part of an enormous test suite where something is bound to fail—that a 
> failure due to random chance seems likely.  In other words, running 10 
> iterations and pretending that will find performance regressions is a poor 
> approach.
> 
> I’ve done some prototyping on algorithms that use a dynamically sized number 
> of trials to find performance regressions.  Apple employees, see 
> rdar://21315474 <rdar://21315474> for an algorithm for a sliding window for 
> performance tests (that also has other benefits, like measuring 
> nanosecond-scale performance).  I am certainly willing to contrib that work 
> in the open if there’s consensus it’s a good direction.
> 
> However, now that this is happening in the open, I’m interested in getting 
> others’ thoughts on this problem.  Surely I am not the only serious user of 
> performance tests, and maybe people with better statistics backgrounds than I 
> have can suggest an appropriate solution.
> 
> Drew
> 
> _______________________________________________
> swift-corelibs-dev mailing list
> swift-corelibs-dev@swift.org
> https://lists.swift.org/mailman/listinfo/swift-corelibs-dev

_______________________________________________
swift-corelibs-dev mailing list
swift-corelibs-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-corelibs-dev

Reply via email to