RegexKitLite 4.0 has been released. Links: Download: http://downloads.sourceforge.net/regexkit/RegexKitLite-4.0.tar.bz2 (139.1K) Documentation: http://regexkit.sourceforge.net/RegexKitLite/index.html PDF Documentation: http://downloads.sourceforge.net/regexkit/RegexKitLite-4.0.pdf (1.1M)
On a personal note, please remember that RegexKitLite is open source software distributed under the BSD License. This means that you are required to acknowledge your use of RegexKitLite in your application. There are an awful lot of "Top 10" applications that use RegexKitLite that don't acknowledge their use or RegexKitLite (or, at least, no acknowledgement can be easily found, even when one makes an effort to find it). At this point, I really don't have a problem starting a "Public Shaming" list of applications and companies that are non-compliant with the licensing terms. It's free. It costs you nothing. Many people have asked for an "exception" to strict compliance with the BSD License- to not include the complete text of the BSD license but just a simple, one line "acknowledgement" in their app, which I have granted. I doubt many people would have a whole lot of sympathy for a company that has just hit the jackpot by being bought by a large "social network" company because of its iPhone app, only to have the app yanked from the App Store due to licensing non-compliance over free, open-source software, whose only "cost" is a simple acknowledgement. RegexKitLite 4.0 is a major release that includes new features, new APIs, and bug fixes. Highlights of what's new in the 4.0 release: PDF Documentation Considerable effort was put in to creating high quality PDF documentation. They style of the documentation is very similar to the Apple ADC produced documentation that you are already familiar with. Overall, I'm very happy with the final results. Improved Performance Previously, the regular expression cache worked on a "1-way set associative" cache, essentially '[regexString hash] % cacheSlots'. RegexKitLite 4.0 uses a 4-way set associative cache with a genuine least recently used replacement policy. This means that strings with congruent hashes mod the number of cache slots (i.e., (hash)94 % (slots)7 == 3 && (hash)122 % (slots)7 == 3) are part of the same set and can now be placed in one of four "ways" in that set. When a set is full, the way in that set that was the least recently used is the one that is picked to be ejected from the cache and filled with the new entry. This change /dramatically/ increases the odds that a previously used regular expression will be found in the compiled regular expression cache. The LRU algorithm chosen is also particularly clever- updating a "way" so that it is the most recently used is completely branchless and compiles to about a dozen instructions. This means that on a modern out of order super scalar CPU, such as the Intel Core 2, true LRU tracking is probably "free", or very close to it, where "free" means there are no cycles used exclusively to perform LRU work. As to why it is important to have a cache of compiled regular expressions, benchmarking showed that compiling a regular expression took ~27,560 ns (on average, based on ~25 different regular expressions), where retrieving a cached compiled regular expression took ~51 ns. The time to retrieve a cached compiled regular expression is constant, regardless of the regular expressions complexity or the time it takes to actually compile the regular expression. This means that it is ~541.3 times faster to retrieve a cached compiled regular expression than it is to actually compile it. In terms of operations per second, this means it is possible to perform 19,680,762 cache look ups per second, compared to 36,285 regular expression compiles per second. Timing done on a MacBook Pro @ 2.66GHz Core 2 on 10.6.2. The ICU regular expression library also requires that the text to be search be encoded in UTF-16 format. How a NSString (or CFString) chooses to encode the contents of the string is an "implementation issue", and normally not something a programmer should worry about. This is one of those times where it can make a difference. The short version is this: Sometimes a Cocoa / Foundation / CoreFoundation string object uses UTF-16 for this purpose, in which case RegexKitLite takes advantages of that fact and uses that buffer directly- an obvious performance win. On the other hand, some times it doesn't, so RegexKitLite needs to convert that buffer in to a UTF-16 encoded format. RegexKitLite also caches these conversions, and RegexKitLite 4.0 now uses the same LRU algorithm to manage these conversions as well. "Officially Supported" In iPhone OS >= 3.2 iPhone OS 3.2 SDK introduced official support for ICU based regular expressions, and linking to the -libicucore ICU dynamic library for the purposes of uses its regular expression engine. Many developers were concerned as to whether or not the use of RegexKitLite represented a violate of the iPhone SDK Agreement prohibiting the use of private or undocumented API's. As of iPhone OS 3.2, this is no longer a concern. See the RegexKitLite documentation for more information and links to Apples documentation about this change. While iPhone OS 3.2 introduced official support for ICU based regular expressions, at this time it is unclear if the iPhone 4.0 SDK Agreement "3.3.1" clause change prohibits the use of RegexKitLite in iPhone OS 4.0. Technically, RegexKitLite is a "compatibility layer" between iPhoneOS/Cocoa/Foundation/NSString and the (now) "Documented APIs" of ICU regular expressions. ---- begin disclaimer ---- It is not my intention to start a religious war on the merits of the 3.3.1 change, or whether or not the change to 3.3.1 was intended to cover things like RegexKitLite. Anyone who believes that 3.3.1 does not cover RegexKitLite is simply kidding themselves. I personally have no doubt what-so-ever that the wording of 3.3.1 covers RegexKitLite. The wording of the clause does not make allowances for the fact that something like RegexKitLite is supplied to you in "source code format" that you compile, or that the source code is written in Objective-C. The wording /is/ explicit that whatever is being considered need only act as a "compatibility layer" between the "documented APIs", which is exactly what RegexKitLite does. QED. Note that this is a different interpretation of 3.3.1 than that taken by many other open source projects which take the position that because the source code, written in one of the approved languages, is supplied, then the 3.3.1 clause does not apply, or the definition of "compatibility layer" usually depends on some project biased interpretation. My intent is to be brutally honest with those developers who need to decide whether or not the risk of using RegexKitLite and the potential consequences of being grounds for rejection outweigh the gains and cost of having to rewrite / re-architect their application without RegexKitLite. This is done under the "hope for the best, plan for the worst" philosophy- the worst case scenario is 3.3.1 covers RegexKitLite, and you should plan accordingly. Caveat Emptor. Currently I have no hard, empirical evidence as to whether or not 3.3.1 applies to RegexKitLite to help you, the developer, make a decision one way or the other. I would appreciate reports on whether or not an application that uses RegexKitLite was accepted or rejected for any version of iPhone OS. This information will be incorporated in to the documentation of later versions to help guide the decisions of future developers. As it currently stands, I am not aware of a single iPhone application that has been rejected due to the use of RegexKitLite, but there are an awful lot of iPhone application that have been accepted- and this was prior to iPhone OS 3.2. ---- end disclaimer ---- New Features RegexKitLite 4.0 now supports the new Blocks feature through a number of new methods (these represent the shorter "convenience" form of the methods): enumerateStringsMatchedByRegex:usingBlock: enumerateStringsSeparatedByRegex:usingBlock: stringByReplacingOccurrencesOfRegex:usingBlock: replaceOccurrencesOfRegex:usingBlock: (NSMutableString) The enumerateStringsMatchedByRegex: method invokes the passed ^block argument for each match, whereas enumerateStringsSeparatedByRegex: uses the supplied regular expression to perform a "split" operation on the string and invokes the supplied ^block argument with split results. The stringByReplacingOccurrencesOfRegex: and replaceOccurrencesOfRegex: methods perform a "search and replace" operation on the string, invoking the supplied ^block argument with details of each match, and replacing the characters matched with the contents of the NSString returned by the ^block. This is very similar to the previous "search and replace" functionality that allowed you to create a replacement string using the contents of various capture groups using the "$n" syntax (i.e., @"First: $2, Last: $1", where $1 and $2 are capture groups one and two from the regular expression). The difference is the Blocks-based replacement way allows complete control over the replacement text instead of the limited "fixed function" capability previously available. The following is an example of what's possible using the new Blocks-based replacement functionality. A common problem when dealing with HTML text is dealing with "&NNN;" and "&xHHH;" encoded characters. Using Blocks-based search and replace, we can match these sequences using a regular expression and replace them with the actual Unicode character that they represent: NSString *string = @"A test: é or é (0xe9 == LATIN SMALL LETTER E WITH ACUTE)\n" @"Even >0xffff are handled: 𝐀 or 𝐀 (0x1d400 == MATHEMATICAL BOLD CAPITAL A)"; NSString *regex = @"&#(?:([0-9]+)|x([0-9a-fA-F]+));"; NSString *replacedString = [string stringByReplacingOccurrencesOfRegex:regex usingBlock:^NSString *(NSInteger captureCount, NSString * const capturedStrings[captureCount], const NSRange capturedRanges[captureCount], volatile BOOL * const stop) { BOOL hexValue = (capturedRanges[1].location == NSNotFound) ? YES : NO; NSString *valueString = (hexValue == NO) ? capturedStrings[1] : capturedStrings[2]; const char *valueUTF8String = [valueString UTF8String]; NSUInteger u16Length = 0UL, u32_ch = 0UL; unichar u16Buffer[3]; u32_ch = strtoul(valueUTF8String, NULL, (hexValue == NO) ? 10 : 16); if (u32_ch <= 0xFFFFU) { u16Buffer[u16Length++] = ((u32_ch >= 0xD800U) && (u32_ch <= 0xDFFFU)) ? 0xFFFDU : u32_ch; } else if (u32_ch > 0x10FFFFU) { u16Buffer[u16Length++] = 0xFFFDU; } else { u32_ch -= 0x0010000UL; u16Buffer[u16Length++] = ((u32_ch >> 10) + 0xD800U); u16Buffer[u16Length++] = ((u32_ch & 0x3FFUL) + 0xDC00U); } return([NSString stringWithCharacters:u16Buffer length:u16Length]); }]; NSLog(@"string :\n%@", string); NSLog(@"replaced:\n%@", replacedString); The output when run: 2010-04-19 20:24:47.092 RegexKitLite[65827:a0f] string : A test: é or é (0xe9 == LATIN SMALL LETTER E WITH ACUTE) Even >0xffff are handled: 𝐀 or 𝐀 (0x1d400 == MATHEMATICAL BOLD CAPITAL A) 2010-04-19 20:24:47.093 RegexKitLite[65827:a0f] replaced: A test: é or é (0xe9 == LATIN SMALL LETTER E WITH ACUTE) Even >0xffff are handled: 𝐀 or 𝐀 (0x1d400 == MATHEMATICAL BOLD CAPITAL A) This works fine for "numerically encoded" character entities, but not for "symbolically encoded" characters, such as &, ≥, , etc. With a little bit of effort, these can be handled too. If speed is a concern, you could even create a NSDictionary that contains keys in the form of "&NAME;" which return a NSString object that contains the Unicode character that the symbolic name represents. A regular expression pattern such as "&[a-fA-F]+;" could then be used to match these entities, and the captured string would be used as the key argument in the NSDictionary method objectForKey:. Theoretically, this should be fairly quick and provide nearly O(1) search and replace time for each symbolically encoded character entity found, regardless of how many character entities are contained in the NSDictionary. Blocks-based search and replace opens up all kinds of new possibilities. Another example would be an "automagic" http:// URL shortener- a regex that matches a http:// URL in a string is fed to a ^block, which contacts a URL shortening service, and returns / replaces the original URL with the shortened URL. _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com