Java does not meet the requirement of RL1.2. It provides only 3 of the 11 require properties; 4 it omits altogether, while 4 others it implements in a fashion contrary to the standard. Java also neglects the strongly recommended aspects of this section, which is quite a pity.
>From tr18: RL1.2 Properties To meet this requirement, an implementation shall provide at least a minimal list of properties, consisting of the following: General_Category Script Alphabetic Uppercase Lowercase White_Space Noncharacter_Code_Point Default_Ignorable_Code_Point ANY ASCII ASSIGNED Of those listed above as *shall provide*, Java indeed provides these three required properties from that minimum set: + The ASCII property. + The General_Categories like \p{Lu}, although only in their short forms; it does not provide the long forms. + The Script categories like \p{Greek}, a very *VERY* welcome addition for Unicode 6.0. Java does not provide these four required properties: - Noncharacter_Code_Point - Default_Ignorable_Code_Point - ANY - ASSIGNED The worst part is that Java gives non-Unicode meanings to these four Unicode properties (I'll give details on these lapses in a separate message): * Alphabetic * Uppercase * Lowercase * White_Space I would like to see all of that addressed that is give above, and I do not understand how you can claim Level 1 conformance without doing so. There are also "strongly recommended" things that you do not implement, like loose matching of property names. That would not cost you much, I feel. tr18's section 1.2 also lists several "recommended" properties, not all of which are binary. Properties that are not absolutely required for compliance of RL1.2, but which I find especially useful, include these binary properties: \p{Dash} \p{Quotation_Mark} \p{Diacritic} \p{Math} If you are going to do \X for extended grapheme clusters instead of legacy grapheme clusters, then you will need access to Hangul Syllable Types, which is not a binary property. The best place to read up on the full set of UCD properties is at http://www.unicode.org/reports/tr44/tr44-4.html#Properties There are several tables of properties there; at the top of the file, though, it says: 1 Introduction The Unicode Standard is far more than a simple encoding of characters. The standard also associates a rich set of semantics with each encoded character--properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names. The data files define the Unicode character properties and mappings between Unicode characters (such as case mappings). That shows how important properties are. The conformance document also includes this statement: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf Interpretation of characters is a more complex issue for the Unicode Standard. It includes the core issue of interpreting code points used as characters according to the names and representative glyphs shown in the code charts, of course. However, the Unicode Standard also specifies character properties, behavior, and interactions between characters. Such information about characters is considered an integral part of the "character semantics established by this standard." Information about the properties, behavior, and interactions between Unicode characters is provided in the Unicode Character Database and in the Unicode Standard Annexes. That again stresses the importance of properties and interactions between characters. Java giving properties the same names that Unicode does but gives them behaviours that are something else entirely is particularly vexing. I cannot see how that is conformant, either. You have to do what they say you have to do with the property names they give you. If you want your own behaviours, you can choose different property names. But theirs are reserved to behave as they define them to behave. I will therefore address the errors I believe Java makes in the Alphabetic, Uppercase, Lowercase, and White_Space properties in my next message, part 2 of RL1.2 Properties. --tom