Here is a summary of my findings: Compliance Req Num Description
??? RL1.1 Hex Notation no RL1.2 Properties no RL1.2a Compatibility Properties yes RL1.3 Subtraction and Intersection no RL1.4 Simple Word Boundaries yes RL1.5 Simple Loose Matches yes RL1.6 Line Boundaries ??? RL1.7 Code Points Because there is at least one unmet requirement for Level 1 Unicode Support in regular expressions, Java is not currently Level 1 compliant and so does not provide even the most basic level of functionality needed for working with regexes according to version 6.0 of the Unicode Standard. I have not assessed the work required to allow it to become so. Notes: RL1.1 This is marked questionable because I am of the opinion that the requirement of being able to specify a code point using hex notation without regard to its internal or external serialized representation is not met, but Sherman is of the opinion that it is. However, it is low priority and easily remedied through the addition of \x{XXXX}. RL1.2 This has many different sorts of problems. RL1.2a This has several problems. RL1.4 This does not meet the requirements. RL1.7 This is very close, save for the problem of ill-formed UTF-16. Furthermore, tr18 has exactly two strong recommendations, both of which Java fails to follow. Strong Recommendation #1: The recommended names for UCD properties and property values are in PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue]. There are both abbreviated names and longer, more descriptive names. It is strongly recommended that both names be recognized, and that loose matching of property names be used, whereby the case distinctions, whitespace, hyphens, and underbar are ignored. Java fails to meet this recommendation in many ways: SR1.0: Java does not allow for the loose matching of property names. SR1.1: Java does not use the recommended names for UCD properties and values. SR1.2: Java omits most of those recommended names. SR1.3: Java uses some recommended names contrary to their required definitions. SR1.4: Java does not allow both the abbreviated names like \p{Nl} and the longer \p{Letter_Number} version. The other strong recommendation is this one. Strong Recommendation #2: It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (e.g. in #1). It would thus be shorthand for: ( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] ) For Java to legitimately claim Level 1 compliance with Unicode 6.0 according to tr18, it must at a bare minimum correct all the "no" compliance categories to "yes". Without that, the claim is false. For Java to be *useful* for processing Unicode text, it should go beyond these barest of minima. A good starting point in that direction would be to finally satisfy SR#1 and SR#2 above. I would also like to see the two "???" matters cleared up, because I believe the intention and pre-existing belief is that they *do* work. Or at least, that they should--bugs notwithstanding. Java also claims it meets RL2.1 on Canonical Compatibility. This is another area where I believe the intention and pre-existing believe are that it meets that requirement, but where edge-case bugs get in the way of doing so. I hope this finally answers your question about why I don't believe Java's regexes meet Level 1 requirements, the minimal functionality needed for handling Unicode text in regular expressions per tr18. To end on a positive note, I am very much looking forward to \X working for grapheme clusters, and very preferably for extended grapheme clusters not merely legacy ones. --tom