I'm sort of at a loss on this one...

While working on some improvements to RegexKitLite, I've come across a
bit of a problem.  I'm in the middle of adding - (NSArray
*)arrayOfStringsMatching:(NSString *)regex and friends, a method that
(surprise!) returns an array of all the matches of a string for a
given regex.  Lots of people use the example regex enumerator included
in the project, but hoisting arrayOfStrings like functionality in to
the core makes things like

for(NSString *matchedString in [subjectString
arrayOfStringsMatching:@"^(.*)$"]) { /* process results */ }

a lot easier and cleaner, and potentially faster to boot.

So, while testing the "faster" part with some simple bench marks, I
came across a serious performance problem once GC was turned on.

Without GC, here's what a simple benchmark consisting of a large
(>1MB) file matched by the regex \S+ in a loop spits out:

arrayOfStringsMatching: Time used: 1619821.0 us, per: 0.919759700420
us, count: 1761135

Profiling with shark shows the top three functions, after charging
malloc/free and friends to their callers:

47.8% in uregex_findNext (this is obviously where the regex matching
takes place)
22.0% in CFStringCreateWithSubstring
12.1% in _CFRelease

The bulk of the remaining time is spent all over the place, with
memcpy and bzero making up a good chunk of it.

Performance counters show an average of 1.611 instructions retired per
clock cycle, 79.101 clock cycles / 127.433 instructions retired
between L1 misses, and 37,731.484 clock cycles / 60,785.422
instructions retired between L2 misses.

Now, with GC turned on I get:

arrayOfStringsMatching: Time used: 6247963.0 us, per: 3.547691119647
us, count: 1761135

Oookay, a 2.8x increase in the time it takes to do the same thing is
sort of a problem.  I could live with, but would be unhappy about, a
10% increase in the time it takes.  A 280% increase in the time it
takes is pretty deep in to "Not Acceptable" territory. Digging in with
shark, the following turns up:

First, two threads:

54.8% in main
45.2% in auto_collection_thread (!!!)

Top 3 functions (for all threads) breaks down to, with no charging of
anything to callers:

39.2% in auto_collection_thread
14.5% in uregex_findNext
11.1% in __spin_lock (!! eleven percent!? in spin_lock!?)

Switching over to just the main thread, again, with no charging of
anything to callers:

26.4% in uregex_findNext
19.0% in __spin_lock (!! WTF?)
16.3% in auto_zone_allocate_object

Performance counters show an average of 1.395 instructions retired per
clock cycle, 45.468 clock cycles / 63.429 instructions retired between
L1 misses, and 23,857.762 clock cycles / 33,281.578 instructions
retired between L2 misses.

This amounts to a 13.5% reduction in instructions retired per clock
cycle, a 50.2% decrease in the number of instructions retired between
L1 misses, and a 45.2% decrease in the number of instructions retired
between L2 misses.  This is pretty horrific, GC is slaughtering the
CPU caches.

The basic overview of what happens internally is that uregex_findNext
is called repeatedly until it returns with "nothing left to find".
The results of the find are stored in to an NSRange *ranges array.

The next step is to convert those ranges in to strings in bulk, the
pointers of the strings are stored in (the equivalent of) "id
*matchedStrings = NSAllocateCollectable(totalRanges * sizeof(id),
NSScannedOption)";

Then, a simple for loop:

for(int x = 0; x < totalRanges; x++) {
  matchedStrings[x] = [subjectString substringWithRange:ranges[x]];
}

After some digging, nearly 12.5% of the total execution time is spent
on "matchedStrings[x] =".  That is, just in the assignment portion of
that statement, the storing of the pointer returned by
substringWithRange:.

This is (obviously) due to -fobjc-gc turning the storing of a __strong
pointer in to a call to objc_assign_strongCast().  Each and every call
to objc_assign_strongCast, in turn, grabs a gc lock before it does its
work.  Soo.. what was a simple, single movl instruction of the
returned pointer in to the matchedStrings array has now turned in to,
at a minimum, a function call and an expensive mutex lock acquisition,
a lock that is pretty heavily contested to boot, and that doesn't even
include any of the work done inside the function itself.

Turning the loop body statement in to "/* matchedStrings[x] = */
[subjectString substringWithRange:ranges[x]];" (creating the string,
but not storing the pointer) turns in a time of:

arrayOfStringsMatching: Time used: 3464158.0 us, per: 1.967003097434
us, count: 1761135

This is just shy of a 50% in reduction in the total execution time by
commenting out just the assignment part of a single statement.

For the record, completely commenting out that one single line (so
that the string isn't created, nor is a pointer stored) turns in a
time of 1133339.0 with GC turned on and 1065135.0 with GC turned off
(gc is about 5% slower), so there's no doubt this single line is
responsible for all of the slow down.  And in case anyone was
wondering, the existing RKLEnumerator supplied with the project turns
in a time of 2242281.0us, so hoisting this functionality up in to the
core is worth about a 27% improvement.

This kind of reduction in performance is just flat out unacceptable,
and I can't think of anything I can do to speed things up.  Any ideas?
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to