[ANN]: RegexKitLite 4.0

John Engelhart Tue, 20 Apr 2010 12:46:56 -0700

RegexKitLite 4.0 has been released.  Links:

Download: http://downloads.sourceforge.net/regexkit/RegexKitLite-4.0.tar.bz2
(139.1K)
Documentation: http://regexkit.sourceforge.net/RegexKitLite/index.html
PDF Documentation:
http://downloads.sourceforge.net/regexkit/RegexKitLite-4.0.pdf (1.1M)


On a personal note, please remember that RegexKitLite is open source
software distributed under the BSD License.  This means that you are
required to acknowledge your use of RegexKitLite in your application.

There are an awful lot of "Top 10" applications that use RegexKitLite
that don't acknowledge their use or RegexKitLite (or, at least, no
acknowledgement can be easily found, even when one makes an effort to
find it).  At this point, I really don't have a problem starting a
"Public Shaming" list of applications and companies that are
non-compliant with the licensing terms.  It's free.  It costs you
nothing.  Many people have asked for an "exception" to strict
compliance with the BSD License- to not include the complete text of
the BSD license but just a simple, one line "acknowledgement" in their
app, which I have granted.  I doubt many people would have a whole lot
of sympathy for a company that has just hit the jackpot by being
bought by a large "social network" company because of its iPhone app,
only to have the app yanked from the App Store due to licensing
non-compliance over free, open-source software, whose only "cost" is a
simple acknowledgement.

RegexKitLite 4.0 is a major release that includes new features, new
APIs, and bug fixes.  Highlights of what's new in the 4.0 release:

PDF Documentation

Considerable effort was put in to creating high quality PDF
documentation.  They style of the documentation is very similar to the
Apple ADC produced documentation that you are already familiar with.
Overall, I'm very happy with the final results.

Improved Performance

Previously, the regular expression cache worked on a "1-way set
associative" cache, essentially '[regexString hash] % cacheSlots'.
RegexKitLite 4.0 uses a 4-way set associative cache with a genuine
least recently used replacement policy.  This means that strings with
congruent hashes mod the number of cache slots (i.e., (hash)94 %
(slots)7 == 3 && (hash)122 % (slots)7 == 3) are part of the same set
and can now be placed in one of four "ways" in that set.  When a set
is full, the way in that set that was the least recently used is the
one that is picked to be ejected from the cache and filled with the
new entry.  This change /dramatically/ increases the odds that a
previously used regular expression will be found in the compiled
regular expression cache.  The LRU algorithm chosen is also
particularly clever- updating a "way" so that it is the most recently
used is completely branchless and compiles to about a dozen
instructions.  This means that on a modern out of order super scalar
CPU, such as the Intel Core 2, true LRU tracking is probably "free",
or very close to it, where "free" means there are no cycles used
exclusively to perform LRU work.

As to why it is important to have a cache of compiled regular
expressions, benchmarking showed that compiling a regular expression
took ~27,560 ns (on average, based on ~25 different regular
expressions), where retrieving a cached compiled regular expression
took ~51 ns.  The time to retrieve a cached compiled regular
expression is constant, regardless of the regular expressions
complexity or the time it takes to actually compile the regular
expression.  This means that it is ~541.3 times faster to retrieve a
cached compiled regular expression than it is to actually compile it.
In terms of operations per second, this means it is possible to
perform 19,680,762 cache look ups per second, compared to 36,285
regular expression compiles per second.  Timing done on a MacBook Pro
@ 2.66GHz Core 2 on 10.6.2.

The ICU regular expression library also requires that the text to be
search be encoded in UTF-16 format.  How a NSString (or CFString)
chooses to encode the contents of the string is an "implementation
issue", and normally not something a programmer should worry about.
This is one of those times where it can make a difference.  The short
version is this: Sometimes a Cocoa / Foundation / CoreFoundation
string object uses UTF-16 for this purpose, in which case RegexKitLite
takes advantages of that fact and uses that buffer directly- an
obvious performance win.  On the other hand, some times it doesn't, so
RegexKitLite needs to convert that buffer in to a UTF-16 encoded
format.  RegexKitLite also caches these conversions, and RegexKitLite
4.0 now uses the same LRU algorithm to manage these conversions as
well.

"Officially Supported" In iPhone OS >= 3.2

iPhone OS 3.2 SDK introduced official support for ICU based regular
expressions, and linking to the -libicucore ICU dynamic library for
the purposes of uses its regular expression engine.  Many developers
were concerned as to whether or not the use of RegexKitLite
represented a violate of the iPhone SDK Agreement prohibiting the use
of private or undocumented API's.  As of iPhone OS 3.2, this is no
longer a concern.  See the RegexKitLite documentation for more
information and links to Apples documentation about this change.

While iPhone OS 3.2 introduced official support for ICU based regular
expressions, at this time it is unclear if the iPhone 4.0 SDK
Agreement "3.3.1" clause change prohibits the use of RegexKitLite in
iPhone OS 4.0.  Technically, RegexKitLite is a "compatibility layer"
between iPhoneOS/Cocoa/Foundation/NSString and the (now) "Documented
APIs" of ICU regular expressions.

---- begin disclaimer ----
It is not my intention to start a religious war on the merits of the
3.3.1 change, or whether or not the change to 3.3.1 was intended to
cover things like RegexKitLite.

Anyone who believes that 3.3.1 does not cover RegexKitLite is simply
kidding themselves.  I personally have no doubt what-so-ever that the
wording of 3.3.1 covers RegexKitLite.  The wording of the clause does
not make allowances for the fact that something like RegexKitLite is
supplied to you in "source code format" that you compile, or that the
source code is written in Objective-C.  The wording /is/ explicit that
whatever is being considered need only act as a "compatibility layer"
between the "documented APIs", which is exactly what RegexKitLite
does. QED.  Note that this is a different interpretation of 3.3.1 than
that taken by many other open source projects which take the position
that because the source code, written in one of the approved
languages, is supplied, then the 3.3.1 clause does not apply, or the
definition of "compatibility layer" usually depends on some project
biased interpretation.

My intent is to be brutally honest with those developers who need to
decide whether or not the risk of using RegexKitLite and the potential
consequences of being grounds for rejection outweigh the gains and
cost of having to rewrite / re-architect their application without
RegexKitLite.  This is done under the "hope for the best, plan for the
worst" philosophy- the worst case scenario is 3.3.1 covers
RegexKitLite, and you should plan accordingly.  Caveat Emptor.

Currently I have no hard, empirical evidence as to whether or not
3.3.1 applies to RegexKitLite to help you, the developer, make a
decision one way or the other.  I would appreciate reports on whether
or not an application that uses RegexKitLite was accepted or rejected
for any version of iPhone OS.  This information will be incorporated
in to the documentation of later versions to help guide the decisions
of future developers.  As it currently stands, I am not aware of a
single iPhone application that has been rejected due to the use of
RegexKitLite, but there are an awful lot of iPhone application that
have been accepted- and this was prior to iPhone OS 3.2.
---- end disclaimer ----

New Features

RegexKitLite 4.0 now supports the new Blocks feature through a number
of new methods (these represent the shorter "convenience" form of the
methods):

enumerateStringsMatchedByRegex:usingBlock:
enumerateStringsSeparatedByRegex:usingBlock:
stringByReplacingOccurrencesOfRegex:usingBlock:
replaceOccurrencesOfRegex:usingBlock: (NSMutableString)

The enumerateStringsMatchedByRegex: method invokes the passed ^block
argument for each match, whereas enumerateStringsSeparatedByRegex:
uses the supplied regular expression to perform a "split" operation on
the string and invokes the supplied ^block argument with split
results.

The stringByReplacingOccurrencesOfRegex: and
replaceOccurrencesOfRegex: methods perform a "search and replace"
operation on the string, invoking the supplied ^block argument with
details of each match, and replacing the characters matched with the
contents of the NSString returned by the ^block.  This is very similar
to the previous "search and replace" functionality that allowed you to
create a replacement string using the contents of various capture
groups using the "$n" syntax (i.e., @"First: $2, Last: $1", where $1
and $2 are capture groups one and two from the regular expression).
The difference is the Blocks-based replacement way allows complete
control over the replacement text instead of the limited "fixed
function" capability previously available.

The following is an example of what's possible using the new
Blocks-based replacement functionality.  A common problem when dealing
with HTML text is dealing with "&NNN;" and "&xHHH;" encoded
characters.  Using Blocks-based search and replace, we can match these
sequences using a regular expression and replace them with the actual
Unicode character that they represent:

NSString *string = @"A test: &#233; or &#xe9; (0xe9 == LATIN SMALL
LETTER E WITH ACUTE)\n"
                   @"Even >0xffff are handled: &#119808; or &#x1D400;
(0x1d400 == MATHEMATICAL BOLD CAPITAL A)";
NSString *regex = @"&#(?:([0-9]+)|x([0-9a-fA-F]+));";

NSString *replacedString = [string
stringByReplacingOccurrencesOfRegex:regex usingBlock:^NSString
*(NSInteger captureCount, NSString * const
capturedStrings[captureCount], const NSRange
capturedRanges[captureCount], volatile BOOL * const stop) {
  BOOL        hexValue        = (capturedRanges[1].location ==
NSNotFound) ? YES : NO;
  NSString   *valueString     = (hexValue == NO) ? capturedStrings[1]
: capturedStrings[2];
  const char *valueUTF8String = [valueString UTF8String];
  NSUInteger  u16Length       = 0UL, u32_ch = 0UL;
  unichar     u16Buffer[3];

  u32_ch = strtoul(valueUTF8String, NULL, (hexValue == NO) ? 10 : 16);

  if (u32_ch <= 0xFFFFU)       { u16Buffer[u16Length++] = ((u32_ch >=
0xD800U) && (u32_ch <= 0xDFFFU)) ? 0xFFFDU : u32_ch; }
  else if (u32_ch > 0x10FFFFU) { u16Buffer[u16Length++] = 0xFFFDU; }
  else                         { u32_ch -= 0x0010000UL;
u16Buffer[u16Length++] = ((u32_ch >> 10) + 0xD800U);
u16Buffer[u16Length++] = ((u32_ch & 0x3FFUL) + 0xDC00U); }

  return([NSString stringWithCharacters:u16Buffer length:u16Length]);
}];

NSLog(@"string  :\n%@", string);
NSLog(@"replaced:\n%@", replacedString);

The output when run:

2010-04-19 20:24:47.092 RegexKitLite[65827:a0f] string  :
A test: &#233; or &#xe9; (0xe9 == LATIN SMALL LETTER E WITH ACUTE)
Even >0xffff are handled: &#119808; or &#x1D400; (0x1d400 ==
MATHEMATICAL BOLD CAPITAL A)
2010-04-19 20:24:47.093 RegexKitLite[65827:a0f] replaced:
A test: é or é (0xe9 == LATIN SMALL LETTER E WITH ACUTE)
Even >0xffff are handled: 𝐀 or 𝐀 (0x1d400 == MATHEMATICAL BOLD CAPITAL A)

This works fine for "numerically encoded" character entities, but not
for "symbolically encoded" characters, such as &amp;, &ge;, &nbsp;,
etc.  With a little bit of effort, these can be handled too.  If speed
is a concern, you could even create a NSDictionary that contains keys
in the form of "&NAME;" which return a NSString object that contains
the Unicode character that the symbolic name represents.  A regular
expression pattern such as "&[a-fA-F]+;" could then be used to match
these entities, and the captured string would be used as the key
argument in the NSDictionary method objectForKey:.  Theoretically,
this should be fairly quick and provide nearly O(1) search and replace
time for each symbolically encoded character entity found, regardless
of how many character entities are contained in the NSDictionary.

Blocks-based search and replace opens up all kinds of new
possibilities.  Another example would be an "automagic" http:// URL
shortener- a regex that matches a http:// URL in a string is fed to a
^block, which contacts a URL shortening service, and returns /
replaces the original URL with the shortened URL.
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

[ANN]: RegexKitLite 4.0

Reply via email to