Sherman wrote: > So even certain Unicode Properties are not yet supported by > Java RegEx, it does not means they are not supported by the > platform, you should be able to access those Unicode properties > via java.lang.Character class.
Sherman, you're 100% right about that. One case in point the bidirectional European number separator property of characters. That property is available in Java via Character.getDirectionality(int codePoint) == DIRECTIONALITY_EUROPEAN_NUMBER_SEPARATOR That property is not available for use in regexes, meaning that you can use neither the long form \p{Bidi_Class=European_Separator} nor the short form \p{Bc=ES} within your patterns. This is not necessarily a show-shopper, although it does constrain the ways you approach these problems: you cannot and must not use regular expressions on them. That is not always a big deal, though. With respect to the standard Java class Character alone -- without regarding to regular expressions at all -- please compare the Unicode functionality provided by http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html over that provided by the (soon to be) standard http://download.java.net/jdk7/docs/api/java/lang/Character.html Because the ICU library has post-3.0 Unicode support not found in Java proper, it is especially worth looking at closely. You "just" have to use their UCharacter class to get it, not the the standard Character class. I may be wrong about all this--I really wish I were!--but looking over the many significant improvements in ICU's UCharacter class over the standard Java Character class, it really and truly looks to me like Java appears last fully considered Unicode way, way back at its UCD 3.0 release in the year 2000. That is a *very* long time ago in so-called "Internet generations"! I do not mean to give any offence in saying this: it's just what the situation seems to be. Look it over and see whether you don't come to the same conclusion. As I said, I wish I were wrong. I can point out specific difference if you would like, but I think folks familiar with the problem-space will spot them on their own readily enough. What I most dearly love to see Java would be brought fully up to date so that its basic Character class supports whatever the current Unicode release happens to be. Wouldn't that be great? I do understand that this is much too much work to be done by one person alone. Or in a short timespan: I certainly don't think it should be rushed. I believe it should be a *goal*, albeit in my humble opinion an important goal. Time is marching on, and it will be easier to catch up to future Unicode releases once Java catches up to whatever the current Unicode release. That is, I understand that Unicode 3.0 -> 6.0 is a big jump, one requiring quite a bit of real work. But once that happens, something like Unicode 6.0 -> 6.1 should be much easier. --tom PS: I'm trying to keep these messages to under 100 lines each.