In the JDK7 Pattern documentation, it says: This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
But the very first thing in tr18's conformance section reads: C0. An implementation claiming conformance to this specification at any Level shall identify the version of this specification and the version of the Unicode Standard. What is therefore missing from the JDK7 j.u.r.Pattern documentation is a mandatory pair of concrete version citations: one about tr18 and one about which version of Unicode. Full citation forms can be found at: http://www.unicode.org/versions/#References http://www.unicode.org/versions/components-6.0.0.html The versions I have been myself using for these analyses are: UTS#18, "Unicode Regular Expressions", version 13 from August 29, 2008. http://www.unicode.org/reports/tr18/tr18-13.html The Unicode Standard, version 6.0.0 from October 11, 2011. http://www.unicode.org/versions/Unicode6.0.0/ I believe that JDK7 is in a functionality freeze. One thing that should still be possible even at this late stage in the JDK7 cycle is to recast the single conformance statement in the documentation into a more fine-grained set of statements corresponding to each of RL1.1 through RL1.7. This is the approach taken by Perl. Instead of a broad brush, we list in columnar format each of the RL numbers along with our current status toward meeting that requirement, with footnotes giving any needed elaboration. See the section "Unicode Regular Expression Support Level" in the perlunicode manpage for how this looks (and preferably in a current release, so Perl 5.12 or better). After my signature I give an example of this from our current release. I think this is probably the best way to go anyway, but it is clearly the only choice given the demands of sound and stable release engineering. That's because although it may be possible to fix one or two changes in where there is a clear bug at variance with documented behavior, I do not believe it possible to sneak in the non-trivial changes needed for things like RL1.2. I also suggest that some thought be paid toward how to go about implementing full Level 1 conformance in as useful but painless a manner possible. I have several ideas related to maintaining backwards compatibility while still moving foward. This necessarily requires more deliberation, and is clearly beyond what it allowable under a functionality freeze. But updating the documentation should not be. --tom For comparison purposes only, here is the Perl's conformance statement from the perlunicode manpage. The footnotes indicate how each requirement is (or is not) met. I include only the Level 1 matters; Levels 2 and 3 are not well-supported at this time, being limited to \X and \N{}; tailoring is available via Unicode::Collate and Unicode::Collate::Locale classes, and normalization via Unicode::Normalize, but there are not yet integrated into the regular expression system proper. =head2 Unicode Regular Expression Support Level The following list of Unicode support for regular expressions describes all the features currently supported. The references to "Level N" and the section numbers refer to the Unicode Technical Standard #18, "Unicode Regular Expressions", version 11, in May 2005. Level 1 - Basic Unicode Support RL1.1 Hex Notation - done [1] RL1.2 Properties - done [2][3] RL1.2a Compatibility Properties - done [4] RL1.3 Subtraction and Intersection - MISSING [5] RL1.4 Simple Word Boundaries - done [6] RL1.5 Simple Loose Matches - done [7] RL1.6 Line Boundaries - MISSING [8] RL1.7 Supplementary Code Points - done [9] {IMPLEMENTATION FOOTNOTES} [1] \x{...} [2] \p{...} \P{...} [3] supports not only minimal list, but all Unicode character properties (see L</Unicode Character Properties>) [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] [5] can use regular expression look-ahead [a] or user-defined character properties [b] to emulate set operations [6] \b \B [7] note that Perl does Full case-folding in matching (but with bugs), not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9, not with 1F80. This difference matters mainly for certain Greek capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character. [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); should also affect <>, $., and script line numbers; should not split lines within CRLF [c] (i.e. there is no empty line between \r and \n) [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF but also beyond U+10FFFF [d] {LETTERED FOOTNOTES} [a] You can mimic class subtraction using lookahead. For example, what UTS#18 might write as [{Greek}-[{UNASSIGNED}]] in Perl can be written as: (?!\p{Unassigned})\p{InGreekAndCoptic} (?=\p{Assigned})\p{InGreekAndCoptic} But in this particular example, you probably really want \p{GreekAndCoptic} which will match assigned characters known to be part of the Greek script. Also see the Unicode::Regex::Set module, it does implement the full UTS#18 grouping, intersection, union, and removal (subtraction) syntax. [b] '+' for union, '-' for removal (set-difference), '&' for intersection (see L</"User-Defined Character Properties">) [c] Try the C<:crlf> layer (see L<PerlIO>). [d] U+FFFF will currently generate a warning message if 'utf8' warnings are enabled