Hi Tom,

Thanks for looking into the Unicode support issues in Java RegEx.

Since you haven been working on Unicode in the past decade, I'm sure you understand that most of the issues you are pointing out here belongs to the "Extended Unicode Support: Level 2" as documented in UTS#18 Unicode Regular Expressions [2]. Unfortunately the current Java RegEx implementation only supports the "Basic Unicode Support: Level 1", as specified in Java RegEx
java.util.regex.Pattern API document [1].

I'm aware of and impressed by the Unicode support added "recently" in Perl 6, was planning to close the gap (basically Java RegEx is the implementation that "matches" perl 5) in JDK7. Due to resource issue I only managed to add in the script and name support in RegEx and Character class. hope I can have more in the next couple months otherwise the rest will be deferred to JDK8
(the \X probably is the most important one next on my list)

As regarding the POSIX properties. In Java RegEx Unicode Alphabetic, Lowercase or Whitespace properties are supported by using \p{javaLetter}, \p{javaLowerCase}, \p{javaUpperCase} or \p{javaWhitespace}. The \p{Lower/Upper/ASCII/Alpha...}, as noticed, are clearly specified by the Java RegEx specification[1] that are for US_ASCII only (does Perl 5 work in this way as well?) This is by design and I don't agree "this is a mess" conclusion. While there are developers over there might like these properties to evolve to be the Unicode properties, I am pretty much sure there might be the same amount of developers there would prefer these properties be kept as the "original" POSIX properties. At the end, Java RegEx is NOT a Unicode RegEx, while it supports Unicode RegEx at certain level, sometime via different syntax, I don't feel this is a big problem for
most Java developers and should not be a stopper for most program.

I would also like to point out that Java is NOT a RegEx based language/platform, RegEx is not part of the Java language (I means the language specification), it is one of the utility packages in Java platform's core libraries. So even certain Unicode Properties are not yet supported by Java RegEx, it does not means they are not supported by the platform, you should be able to access those Unicode properties via java.lang.Character class[4]. So I would strong disagree the comment that "Java’s Unicode property support is *strictly antemillennial*, by which I mean it supports no Unicode property that has come out in the last decade.":-) Even Java RegEx is NOT that bad,
the script, block can category property support are pretty "up to date".

Anyway, as I said we do have "plan" to improve the Unicode Regex support in Java RegEx and are adding more pieces into it, while it might be a little slower than people would like to see (currently I can only spend less than 5% of my time to RegEx for JDK7, hope I can allocate more time the next couple months). The good news is that Java is now a open source project/ platform, I'm sure your decade of experience in Unicode and Perl would definite help should you decide to contribute [5]. Even without direct code contribution, it would still benefit the java community if you can spend some time to list all your concerns about the Unicode support in Java RegEx, I promise I will go through them one by one (I will look into [3]
next week in more details next week) .

I believe most of the Java Unicode "expert" are on this mailing list, so we can start from here.


Thanks,
Sherman


[1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html
[2] http://www.unicode.org/reports/tr18
[3] http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261
[4] http://download.java.net/jdk7/docs/api/java/lang/Character.html
[5] http://openjdk.java.net/contribute/



On 12/11/2010 09:38, Tom Christiansen wrote:
Good morning,

I'm Tom Christiansen; some of you may know me from my work in the Perl
Community.  I'm here at the urging of Martijn Verburg, who thought that my
recent discoveries should be heard by your group.

I've been professionally programming for more than 25 years now, mostly in
C and Perl.  I recently joined the biomedical text-mining group at the
University of Colorado, where the bulk of our code base is in Java.

I've been responsible for working with large text corpora entirely in
Unicode.  For example, one corpus comprises almost 200,000 papers and 11
gigabytes, while another is a single file of 6 gigabytes.  I'm not new to
Unicode, having worked with it a great deal over the last decade.

Although most of our code base is in Java, we also have a considerable
portion of Perl code and some Python code, too.  This code often first
tokenizes the input stream before moving on to more sophisticated semantic
processing.  I was quite surprised to learn how differently Java treated
Unicode text than how the same text is treated by Perl and Python, even
using identical regular expressions.  This has proved to be a significant
barrier to fully adopting Java for our Unicode work.

This prompted me to make a comprehensive study of Unicode issues in Java,
focusing on regular expressions but also exploring other areas.  I've
identified about two dozen individual areas that I feel deserve to be
looked at.  These range from mismatches between documentation and behavior,
to unfortunate or inconvenient defaults (e.g. "documented not to work"), to
genuine bugs and international standards violations.

Taken as a whole, these problem areas make Java a very difficult choice for
the sort of text processing my group needs to use it for.  Surely many
others all around the world are in a similar position.

I've searched the archives for this mailing list, and have found no mention
of these troubles either there, or indeed anywhere at all on the web.  For
example:

     
http://www.google.com/search?client=opera&rls=en&q=site:http://mail.openjdk.java.net/pipermail/i18n-dev+unicode&sourceid=opera&ie=utf-8&oe=utf-8

I have working code that fixes what for us is the most egregious of these
problems: that regexes were unusable on Unicode.  One fundamental bug is
that Java has misunderstood the connection between \b and \w regexes, so
that now a string like "élève" is not matched by the pattern "\b\w+\b" at
any point in the string.

Other very serious problems include Java's unjustifiable demotion of legal
Unicode whitespace characters from the set of whitespace characters
(breaking tokenization), using Unicode property names in ways contrary to
what the spec says they do, and in general supporting no Unicode properties
any later than 3.0: even the critical Unicode 3.1 properties are ignored by
Java.  These are very serious problems.  Java almost cannot be said to
support Unicode--at least any Unicode release from the last ten
years--until these critical deficiencies are fixed.

You can find a brief synopsis of these specific troubles as well as a link
to the Java code that fixes them here:

     
http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261

I don't by any means think this is the best way to go about this.  It's
just a band-aide we needed quickly to allow us to move on with our work.
I'd like to offer it as a starting point for discussion of the issues that
prompted its creation.

As I mentioned, I have a couple dozen different Java Unicode issues, and
this addresses just one or two of them.  When I get time, I'll try to bring
up the others here in separate threads.

If you could advise me how best to contribute to helping out here, I would
be grateful.

Thank you,

--tom

Reply via email to