>I have filed CR/RFE 7036910: >j.l.Character.toLowerCaseCharArray/toTitleCaseCharArray for this request.
Thanks very much. > The j.l.Character.toLowerCase/toUpperCase() suggests to use > String.toLower/UpperCase() for case mapping, if you want 1:M mapping > taken care. And if you trust the API:-), which you should in this > case, you will find that String.toLowerCase/toUpperCase() do handle > 1:M correctly. > Yes, we don't have a toLowerCaseCharArray() in j.l.c, however, as you > noticed that there is ONLY one 1:M case mapping for toLowerCase, at > least for now, and our String.toLowerCase() implementation > "hardcodeds" that u+0130 as the special case. Ahah good. I had a feeling I should have looked the the String source. > That said, I yet to dig out the history of toUpperCaseCharArray... and > I agree, from API design point of view, it would be more nature to > have the pair. Well, the thing that seems me to be more missing is the toTitleCaseCharArray since it would be more apt to come up. Right now you can't get at the full casemapping for titlecase from Java, and you do sometimes need it. It's harder to come up with reasonable demos in Latin than in Greek, since mostly in Latin we have the ff/fi/ffl/ffi ligatures, whereas in Greek there are lots of examples where you need full titlecasing, not simple. Here's one: lower: ᾲ στο διάολο lower: \x{1FB2} \x{3C3}\x{3C4}\x{3BF} \x{3B4}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x{3BF} title: Ὰͅ Στο Διάολο title: \x{1FBA}\x{345} \x{3A3}\x{3C4}\x{3BF} \x{394}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x{3BF} upper: ᾺΙ ΣΤΟ ΔΙΆΟΛΟ upper: \x{1FBA}\x{399} \x{3A3}\x{3A4}\x{39F} \x{394}\x{399}\x{386}\x{39F}\x{39B}\x{39F} That's because U+1FB2 goes to U+1FBA U+0399 for uppercase, but it goes to U+1FBA U+0345 in titlecase. The lowercase "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}" becomes this two-codepoint sequence in uppercase: "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA}" but becomes this two-codepoint sequence in uppercase: "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI}" That's why the U+0345 COMBINING GREEK YPOGEGRAMMENI is a \p{Lowercase} code point, despite its being \p{GC=Mn}. It's this kind of thing that set me to fixing the j.l.Character documentation, because isLowerCase and isUppercase and such were misstating what they did. All they do is test for \p{GC=Ll} and \p{GC=Lu} respectively; they do not actually test for \p{Lowercase} and \p{Uppercase}, which are binary properties that work on more than just letters. I don't know what you can do with the API. If one had it to do over again, it would be clearly preferable to distiguish isLowerCaseLetter vs isLowerCase isUpperCaseLetter vs isUppercase where the latter is the full test and the former is only for letters. But you of course can't do that now, so you're stuck with the existing name. I've tried think of an alternate name for \p{Lowercase} that allows you to stick with the existing naming (which of course is absolutely mandatory). The problem is that it fails the Huffman encoding principle of making the shorter thing the more commonly used variant, but I can't see a way isLowerCase \p{GC=Ll} \p{Lowercase_Letter} isUpperCase \p{GC=Lu} \p{Uppercase_Letter} I think people will go nuts if they have to type this: isLowerCaseCodePoint \p{Lowercase} \p{Lower} isUpperCaseCodePoint \p{Uppercase} \p{Upper} I suppose you might be able to do this: isLower \p{Lowercase} \p{Lower} isUpper \p{Uppercase} \p{Upper} Not that I'm using the official Unicode property names there, because PropertyAliases.txt defines Lower ; Lowercase Upper ; Uppercase And of course Uppercase and Lowercase are the properties that work for all code points, not just Letters. That is, they're the non-GC versions from: http://www.unicode.org/reports/tr44/#Property_Index The \p{upper} and \p{lower} style is also what tr18's RL1.2a uses for compatibility properties: those are in lines 2 and 3 of the compat table. http://unicode.org/reports/tr18/#Compatibility_Properties > Yes, we do have a RFE 6423415: (str) Add String.toTitleCase() > But given the nature of "title case", the String#toTitleCase() might > not be what you would like it to be. It would be strange if > String#toTitleCase() does the similar thing the > String.toLower/UpperCase() do, in which it title-case-maps all > characters inside the String, most people probably would expect it > only title-case-map the first character of the "title string". RFE > 6423415 has very low priority for now. > It might be more reasonable to have j.l.Character.toTitleCaseCharArray() > instead of j.l.String.toTitleCase(). Yes, I think you're right. In fact, Perl does not provide a function that will titlecase *all* of a string (although you can always write a loop). We only have a function to titlecase the string's first code point, called for compatibility reasons ucfirst() and available as the "\u" string escape. That is, "\u$a" compiles to ucfirst($a). For the whole string, we use uc() (or \U) which uppercases, not titlecases. And of course lc (or \L) lowercases the whole string, although lcfirst (\u) just does the first character. That means to generate the strings above, I wrote s/(\w+)/\u\L$1/g; which as a code expression instead of string intepolation would s/(\w+)/ucfirst(lc($1))/ge; That's a bit cavalier, of course, since \w grabs more than just things that change case. However, you can't write: s/(\pL+)/\u\L$1/g; because that misses the nonletters. The \p{Alphabetic} property should I believe work for this. s/(\p{alpha}+)/\u\L$1/g; That works because Perl uses the entry from PropertyAliases.txt: Alpha ; Alphabetic which is also the RL1.2a guideline, even in POSIX compat mode. But because we have access to all Unicode properties, there are more arguably more appropriate ones, like \p{Cased} -- except that doesn't guarantee that the thing will change (in case you are). These however do (from PropertyAliases.txt): CWCF ; Changes_When_Casefolded CWCM ; Changes_When_Casemapped CWKCF ; Changes_When_NFKC_Casefolded CWL ; Changes_When_Lowercased CWT ; Changes_When_Titlecased CWU ; Changes_When_Uppercased So you could do any of a bunch of things: s/(\p{Cased}+)/\u\L$1/g; s/(\p{CWT}\p{CWL}+)/\u\L$1/g s/(\p{CWT})(\p{CWL}+)/\u$1\L$2/g; In practice some \b boundaries might be a good idea there. You really have quite a lot of flexibility when you have all the Unicode properties available to you. I don't know how you're going to get the properties into Java. You have a problem already at Level 1, which doesn't require very many. What you'll do when you get to the rest, I don't quite know, but I think you will have to choose some sort of prefix for the properties whose names you have already defined in a way that conflicts with the Unicode definition. Maybe a leading "U"? Since underscores don't (well, aren't *supposed* to) count, that could just be: \p{U_Space} \p{U_Alpha} \p{U_Lower} etc. There is a proposed revision to tr18 that outlines this path toward compliance as a perfectly valid one. http://unicode.org/reports/tr18/proposed.html#Full_Properties RL2.7 Full Properties To meet this requirement, an implementation shall support all of the properties listed below that are in the supported version of Unicode, with values that match the Unicode definitions for that version. As in RL1.2 Properties, in order to meet requirement RL2.7, the implementation has to satisfy the Unicode definition of the properties for the supported version of Unicode, not other possible definitions. However, the names used for the properties might need to be different for compatibility. For example, if a regex engine already has "Alphabetic", for compatibility it may need a different name, such as "Unicode_Alphabetic" for the Unicode property. The list excludes contributed properties, obsolete and deprecated properties, and the Unicode 1 Name and Unicode Radical Stroke properties. The properties in gray are covered by RL1.2 Properties. It seems to me that you might be going to need this for RL1.2, also, since you have definitions for the POSIX properties that don't match what RL1.2a says they should. In Perl, we split of the [:upper:] things from the \p{upper} things so that we could be strictly POSIXy on the former but fully compliant with tr18 on the latter. in Java you don't have the former syntax available, and your version of the latter syntax is "wrong". This is just part of my fixing up j.l.Pattern docs will take longer. Mostly I want to fix the things it says about Perl that are wrong. Some of those are wrong because they're outdated, and some are wrong because they were never true. Do you think I should use 5.12 as the version of Perl compared against, or should I use 5.14 (which is in late RC0) because it is the one that used Unicode 6.0 and so would match JDK7? --tom