I'm sort of at a loss on this one... While working on some improvements to RegexKitLite, I've come across a bit of a problem. I'm in the middle of adding - (NSArray *)arrayOfStringsMatching:(NSString *)regex and friends, a method that (surprise!) returns an array of all the matches of a string for a given regex. Lots of people use the example regex enumerator included in the project, but hoisting arrayOfStrings like functionality in to the core makes things like
for(NSString *matchedString in [subjectString arrayOfStringsMatching:@"^(.*)$"]) { /* process results */ } a lot easier and cleaner, and potentially faster to boot. So, while testing the "faster" part with some simple bench marks, I came across a serious performance problem once GC was turned on. Without GC, here's what a simple benchmark consisting of a large (>1MB) file matched by the regex \S+ in a loop spits out: arrayOfStringsMatching: Time used: 1619821.0 us, per: 0.919759700420 us, count: 1761135 Profiling with shark shows the top three functions, after charging malloc/free and friends to their callers: 47.8% in uregex_findNext (this is obviously where the regex matching takes place) 22.0% in CFStringCreateWithSubstring 12.1% in _CFRelease The bulk of the remaining time is spent all over the place, with memcpy and bzero making up a good chunk of it. Performance counters show an average of 1.611 instructions retired per clock cycle, 79.101 clock cycles / 127.433 instructions retired between L1 misses, and 37,731.484 clock cycles / 60,785.422 instructions retired between L2 misses. Now, with GC turned on I get: arrayOfStringsMatching: Time used: 6247963.0 us, per: 3.547691119647 us, count: 1761135 Oookay, a 2.8x increase in the time it takes to do the same thing is sort of a problem. I could live with, but would be unhappy about, a 10% increase in the time it takes. A 280% increase in the time it takes is pretty deep in to "Not Acceptable" territory. Digging in with shark, the following turns up: First, two threads: 54.8% in main 45.2% in auto_collection_thread (!!!) Top 3 functions (for all threads) breaks down to, with no charging of anything to callers: 39.2% in auto_collection_thread 14.5% in uregex_findNext 11.1% in __spin_lock (!! eleven percent!? in spin_lock!?) Switching over to just the main thread, again, with no charging of anything to callers: 26.4% in uregex_findNext 19.0% in __spin_lock (!! WTF?) 16.3% in auto_zone_allocate_object Performance counters show an average of 1.395 instructions retired per clock cycle, 45.468 clock cycles / 63.429 instructions retired between L1 misses, and 23,857.762 clock cycles / 33,281.578 instructions retired between L2 misses. This amounts to a 13.5% reduction in instructions retired per clock cycle, a 50.2% decrease in the number of instructions retired between L1 misses, and a 45.2% decrease in the number of instructions retired between L2 misses. This is pretty horrific, GC is slaughtering the CPU caches. The basic overview of what happens internally is that uregex_findNext is called repeatedly until it returns with "nothing left to find". The results of the find are stored in to an NSRange *ranges array. The next step is to convert those ranges in to strings in bulk, the pointers of the strings are stored in (the equivalent of) "id *matchedStrings = NSAllocateCollectable(totalRanges * sizeof(id), NSScannedOption)"; Then, a simple for loop: for(int x = 0; x < totalRanges; x++) { matchedStrings[x] = [subjectString substringWithRange:ranges[x]]; } After some digging, nearly 12.5% of the total execution time is spent on "matchedStrings[x] =". That is, just in the assignment portion of that statement, the storing of the pointer returned by substringWithRange:. This is (obviously) due to -fobjc-gc turning the storing of a __strong pointer in to a call to objc_assign_strongCast(). Each and every call to objc_assign_strongCast, in turn, grabs a gc lock before it does its work. Soo.. what was a simple, single movl instruction of the returned pointer in to the matchedStrings array has now turned in to, at a minimum, a function call and an expensive mutex lock acquisition, a lock that is pretty heavily contested to boot, and that doesn't even include any of the work done inside the function itself. Turning the loop body statement in to "/* matchedStrings[x] = */ [subjectString substringWithRange:ranges[x]];" (creating the string, but not storing the pointer) turns in a time of: arrayOfStringsMatching: Time used: 3464158.0 us, per: 1.967003097434 us, count: 1761135 This is just shy of a 50% in reduction in the total execution time by commenting out just the assignment part of a single statement. For the record, completely commenting out that one single line (so that the string isn't created, nor is a pointer stored) turns in a time of 1133339.0 with GC turned on and 1065135.0 with GC turned off (gc is about 5% slower), so there's no doubt this single line is responsible for all of the slow down. And in case anyone was wondering, the existing RKLEnumerator supplied with the project turns in a time of 2242281.0us, so hoisting this functionality up in to the core is worth about a 27% improvement. This kind of reduction in performance is just flat out unacceptable, and I can't think of anything I can do to speed things up. Any ideas? _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com