Sherman wrote: > Thanks for the detailed and excellent "reality check". While I'm still > going through all the details it appears that the fact the current > Java Unicode property data does not include the properties defined in > PropList.txt (current implementation reads the property data only from > UnicodeData.txt, Scripts, Blocks and SpecialCasing.txt,) contributes > to lots of issues raised, which means property data of > Other_Alphabetic/Lowercse/Uppercase and White_Space are not available > for j.u.regex and j.l.Character.
Ahah, *finally* I begin to understand! If you read reading property values from nothing but those four files alone, that now explains why most Unicode properties are missing in Java regexes. *That's* why it always seems that Java only supports the UCD 3.0 properties, plus Blocks and now Scripts added since then. I suspect no one has ever taken a *really* good look at tr18, tr44, and the layout of the UCD since j.u.regex was first written. Could this perhaps be possible? It would explain much. I'm sure you looked at the script stuff, but there's a lot more happening in the UCD these days then there was back when j.u.regex was written. Unicode 6.0.0 has 112 properties but as far as I can tell, j.u.regex supports only 3 of those: General_Category, Script, and Block. Not all 112 are critical, or even generally useful, but at least four of them *are* on tr18's RL1.2 list of required properties, plus a few more if you count RL1.2a. So it is very important that they be there. PropList.txt governs the following properties, not all of which are binary: ASCII_Hex_Digit Join_Control Other_Uppercase Bidi_Control Logical_Order_Exception Pattern_Syntax Dash Noncharacter_Code_Point Pattern_White_Space Deprecated Other_Alphabetic Quotation_Mark Diacritic Other_Default_Ignorable_Code_Point Soft_Dotted Extender Other_Grapheme_Extend STerm Hyphen Other_ID_Continue Terminal_Punctuation Ideographic Other_ID_Start Unified_Ideograph IDS_Binary_Operator Other_Lowercase Variation_Selector IDS_Trinary_Operator Other_Math White_Space Some of those properties listed above are then used to help establish these from DerivedCoreProperties.txt: Alphabetic Changes_When_Lowercased Grapheme_Extend Math Cased Changes_When_Titlecased Grapheme_Link Uppercase Case_Ignorable Changes_When_Uppercased ID_Continue XID_Continue Changes_When_Casefolded Default_Ignorable_Code_Point ID_Start XID_Start Changes_When_Casemapped Grapheme_Base Lowercase Not counting the sets of General_Category=XXX and Script=XXX properties, those properties above probably include the most important ones--although there are many more. The PropertyAliases.txt file contains the list of *all* top-level Unicode property names and their short-cut aliases. There are 112 official properties of Unicode 6.0.0, and many of these are populated using files other than the 4 that you mention. I include the list of these in their longest aliases at the bottom of this message. The reason Perl handles all official Unicode properties is because it employs a very elaborate build system that generates not only all the tables needed, but also documentation and test cases. To set up its property tables, at build time Perl processes all of the *.txt, extracted/*.txt, and auxiliary/*.txt files from the Unicode Character Database. These are in the lib/unicore/ subdirectory of Perl's top-level source directory if you're interested. It does this using the mktables script, also located in lib/unicore. Perl's build ignores provisional-only properties so people don't get used to something that may go away, but handles all the rest of them. The mktables program is large and fairly complex, although well structured into a set of co-operating packages and classes (all in the same file!) and very well documented. I include at the end of this message an excerpt of the internal documentation from mktables that explains its overall approach. > j.u.regex is trying the "closest" possible set for the alphabetic, > lower/uppercase, I see now: you just don't have any better data available to you for this. There is little you can to about that until that data should become available to/from Java. Once it does though, the rest should follow pretty directly. But it's not at all a small issue that's easily patched up. It will require some serious design and testing. It would be good goal for JDK8, I think. > I will file a RFE to trace this issue. Thank you very very much. I will answer the other half of your message, the part about RL1.4, later on today. Hope this helps! --tom Here are all 112 official Unicode 6.0.0 properties. Some are intended to be "internal only" because used to generate other higher-level properties (like Other_Alphabetic used to help generate Alphabetic), while a few have been deprecated (like the legacy binary Hyphen property replaced by the more fine-grained Word_Break=XXX properties): Age General_Category Other_Alphabetic Alphabetic Grapheme_Base Other_Default_Ignorable_Code_Point ASCII_Hex_Digit Grapheme_Cluster_Break Other_Grapheme_Extend Bidi_Class Grapheme_Extend Other_ID_Continue Bidi_Control Grapheme_Link Other_ID_Start Bidi_Mirrored Hangul_Syllable_Type Other_Lowercase Bidi_Mirroring_Glyph Hex_Digit Other_Math Block Hyphen Other_Uppercase Canonical_Combining_Class ID_Continue Pattern_Syntax Cased Ideographic Pattern_White_Space Case_Folding IDS_Binary_Operator Quotation_Mark Case_Ignorable ID_Start Radical Changes_When_Casefolded IDS_Trinary_Operator Script Changes_When_Casemapped ISO_Comment Sentence_Break Changes_When_Lowercased Jamo_Short_Name Simple_Case_Folding Changes_When_NFKC_Casefolded Join_Control Simple_Lowercase_Mapping Changes_When_Titlecased Joining_Group Simple_Titlecase_Mapping Changes_When_Uppercased Joining_Type Simple_Uppercase_Mapping Composition_Exclusion Line_Break Soft_Dotted Dash Logical_Order_Exception STerm Decomposition_Mapping Lowercase Terminal_Punctuation Decomposition_Type Lowercase_Mapping Titlecase_Mapping Default_Ignorable_Code_Point Math Unicode_1_Name Deprecated Name Unicode_Radical_Stroke Diacritic Name_Alias Unified_Ideograph East_Asian_Width NFC_Quick_Check Uppercase Expands_On_NFC NFD_Quick_Check Uppercase_Mapping Expands_On_NFD NFKC_Casefold Variation_Selector Expands_On_NFKC NFKC_Quick_Check White_Space Expands_On_NFKD NFKD_Quick_Check Word_Break Extender Noncharacter_Code_Point XID_Continue FC_NFKC_Closure Numeric_Type XID_Start Full_Composition_Exclusion Numeric_Value The guts of the mktables program's algorithm are explained here: # This program works on all non-provisional properties as of 6.0, though the # files for some are suppressed from apparent lack of demand for them. You # can change which are output by changing lists in this program. # # The old version of mktables emphasized the term "Fuzzy" to mean Unicode's # loose matchings rules (from Unicode TR18): # # The recommended names for UCD properties and property values are in # PropertyAliases.txt [Prop] and PropertyValueAliases.txt # [PropValue]. There are both abbreviated names and longer, more # descriptive names. It is strongly recommended that both names be # recognized, and that loose matching of property names be used, # whereby the case distinctions, whitespace, hyphens, and underbar # are ignored. # # The program still allows Fuzzy to override its determination of if loose # matching should be used, but it isn't currently used, as it is no longer # needed; the calculations it makes are good enough. # # SUMMARY OF HOW IT WORKS: # Each file on the list is processed in a loop, using the associated handler # code for each: # The PropertyAliases.txt and PropValueAliases.txt files are processed # first. These files name the properties and property values. # Objects are created of all the property and property value names # that the rest of the input should expect, including all synonyms. # The other input files give mappings from properties to property # values. That is, they list code points and say what the mapping # is under the given property. Some files give the mappings for # just one property; and some for many. This program goes through # each file and populates the properties from them. Some properties # are listed in more than one file, and Unicode has set up a # precedence as to which has priority if there is a conflict. Thus # the order of processing matters, and this program handles the # conflict possibility by processing the overriding input files # last, so that if necessary they replace earlier values. # After this is all done, the program creates the property mappings not # furnished by Unicode, but derivable from what it does give. # The tables of code points that match each property value in each # property that is accessible by regular expressions are created. # The Perl-defined properties are created and populated. Many of these # require data determined from the earlier steps # Any Perl-defined synonyms are created, and name clashes between Perl # and Unicode are reconciled and warned about. # All the properties are written to files # Any other files are written, and final warnings issued.