Unicode support in class Character

Tom Christiansen Thu, 20 Jan 2011 13:19:03 -0800

Sherman wrote:

> So even certain Unicode Properties are not yet supported by
> Java RegEx, it does not means they are not supported by the
> platform, you should be able to access those Unicode properties
> via java.lang.Character class.


Sherman, you're 100% right about that.

One case in point the bidirectional European number separator
property of characters.  That property is available in Java via

    Character.getDirectionality(int codePoint)
        == DIRECTIONALITY_EUROPEAN_NUMBER_SEPARATOR

That property is not available for use in regexes, meaning
that you can use neither the long form

    \p{Bidi_Class=European_Separator}

nor the short form

    \p{Bc=ES}

within your patterns.  This is not necessarily a show-shopper,
although it does constrain the ways you approach these problems:
you cannot and must not use regular expressions on them.

That is not always a big deal, though.

With respect to the standard Java class Character alone --
without regarding to regular expressions at all -- please
compare the Unicode functionality provided by

    http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html

over that provided by the (soon to be) standard

    http://download.java.net/jdk7/docs/api/java/lang/Character.html

Because the ICU library has post-3.0 Unicode support not found in Java
proper, it is especially worth looking at closely.  You "just" have to use
their UCharacter class to get it, not the the standard Character class.

I may be wrong about all this--I really wish I were!--but looking over
the many significant improvements in ICU's UCharacter class over the
standard Java Character class, it really and truly looks to me like 
Java appears last fully considered Unicode way, way back at its UCD 3.0
release in the year 2000.  That is a *very* long time ago in so-called
"Internet generations"!

I do not mean to give any offence in saying this: it's just what the
situation seems to be.  Look it over and see whether you don't come to the
same conclusion.  As I said, I wish I were wrong.  I can point out specific
difference if you would like, but I think folks familiar with the
problem-space will spot them on their own readily enough.

What I most dearly love to see Java would be brought fully up to date 
so that its basic Character class supports whatever the current Unicode 
release happens to be.  Wouldn't that be great?

I do understand that this is much too much work to be done by one person
alone.  Or in a short timespan: I certainly don't think it should be
rushed.  I believe it should be a *goal*, albeit in my humble opinion an
important goal.  Time is marching on, and it will be easier to catch up 
to future Unicode releases once Java catches up to whatever the current
Unicode release.  

That is, I understand that Unicode 3.0 -> 6.0 is a big jump, one requiring
quite a bit of real work.  But once that happens, something like Unicode 
6.0 -> 6.1 should be much easier.

--tom

    PS: I'm trying to keep these messages to under 100 lines each.

Unicode support in class Character

Reply via email to