Re: Java and Unicode

Xueming Shen Sat, 11 Dec 2010 23:03:51 -0800

 Hi Tom,

Thanks for looking into the Unicode support issues in Java RegEx.

Since you haven been working on Unicode in the past decade, I'm sure youunderstand that mostof the issues you are pointing out here belongs to the "Extended UnicodeSupport: Level 2" asdocumented in UTS#18 Unicode Regular Expressions [2]. Unfortunately thecurrent Java RegEximplementation only supports the "Basic Unicode Support: Level 1", asspecified in Java RegEx

java.util.regex.Pattern API document [1].

I'm aware of and impressed by the Unicode support added "recently" inPerl 6, was planning toclose the gap (basically Java RegEx is the implementation that "matches"perl 5) in JDK7. Dueto resource issue I only managed to add in the script and name supportin RegEx and Characterclass. hope I can have more in the next couple months otherwise the restwill be deferred to JDK8

(the \X probably is the most important one next on my list)

As regarding the POSIX properties. In Java RegEx Unicode Alphabetic,Lowercase or Whitespaceproperties are supported by using \p{javaLetter}, \p{javaLowerCase},\p{javaUpperCase} or\p{javaWhitespace}. The \p{Lower/Upper/ASCII/Alpha...}, as noticed, areclearly specified by theJava RegEx specification[1] that are for US_ASCII only (does Perl 5 workin this way as well?)This is by design and I don't agree "this is a mess" conclusion. Whilethere are developers overthere might like these properties to evolve to be the Unicodeproperties, I am pretty much surethere might be the same amount of developers there would prefer theseproperties be kept asthe "original" POSIX properties. At the end, Java RegEx is NOT a UnicodeRegEx, while it supportsUnicode RegEx at certain level, sometime via different syntax, I don'tfeel this is a big problem for

most Java developers and should not be a stopper for most program.

I would also like to point out that Java is NOT a RegEx basedlanguage/platform, RegEx isnot part of the Java language (I means the language specification), itis one of the utility packagesin Java platform's core libraries. So even certain Unicode Propertiesare not yet supported by JavaRegEx, it does not means they are not supported by the platform, youshould be able to access thoseUnicode properties via java.lang.Character class[4]. So I would strongdisagree the commentthat "Java’s Unicode property support is *strictly antemillennial*, bywhich I mean it supports noUnicode property that has come out in the last decade.":-) Even JavaRegEx is NOT that bad,

the script, block can category property support are pretty "up to date".

Anyway, as I said we do have "plan" to improve the Unicode Regex supportin Java RegExand are adding more pieces into it, while it might be a little slowerthan people would like tosee (currently I can only spend less than 5% of my time to RegEx forJDK7, hope I can allocatemore time the next couple months). The good news is that Java is now aopen source project/platform, I'm sure your decade of experience in Unicode and Perl woulddefinite help shouldyou decide to contribute [5]. Even without direct code contribution, itwould still benefit thejava community if you can spend some time to list all your concernsabout the Unicodesupport in Java RegEx, I promise I will go through them one by one (Iwill look into [3]

next week in more details next week) .

I believe most of the Java Unicode "expert" are on this mailing list, sowe can start from here.



Thanks,
Sherman


[1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html
[2] http://www.unicode.org/reports/tr18

[3]http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261

[4] http://download.java.net/jdk7/docs/api/java/lang/Character.html
[5] http://openjdk.java.net/contribute/



On 12/11/2010 09:38, Tom Christiansen wrote:

Good morning,

I'm Tom Christiansen; some of you may know me from my work in the Perl
Community. I'm here at the urging of Martijn Verburg, who thought that my
recent discoveries should be heard by your group.

I've been professionally programming for more than 25 years now, mostly in
C and Perl. I recently joined the biomedical text-mining group at the
University of Colorado, where the bulk of our code base is in Java.

I've been responsible for working with large text corpora entirely in
Unicode. For example, one corpus comprises almost 200,000 papers and 11
gigabytes, while another is a single file of 6 gigabytes. I'm not new to
Unicode, having worked with it a great deal over the last decade.

Although most of our code base is in Java, we also have a considerable
portion of Perl code and some Python code, too. This code often first
tokenizes the input stream before moving on to more sophisticated semantic
processing. I was quite surprised to learn how differently Java treated
Unicode text than how the same text is treated by Perl and Python, even
using identical regular expressions. This has proved to be a significant
barrier to fully adopting Java for our Unicode work.

This prompted me to make a comprehensive study of Unicode issues in Java,
focusing on regular expressions but also exploring other areas. I've
identified about two dozen individual areas that I feel deserve to be
looked at. These range from mismatches between documentation and behavior,
to unfortunate or inconvenient defaults (e.g. "documented not to work"), to
genuine bugs and international standards violations.

Taken as a whole, these problem areas make Java a very difficult choice for
the sort of text processing my group needs to use it for. Surely many
others all around the world are in a similar position.

I've searched the archives for this mailing list, and have found no mention
of these troubles either there, or indeed anywhere at all on the web. For
example:

http://www.google.com/search?client=opera&rls=en&q=site:http://mail.openjdk.java.net/pipermail/i18n-dev+unicode&sourceid=opera&ie=utf-8&oe=utf-8

I have working code that fixes what for us is the most egregious of these
problems: that regexes were unusable on Unicode. One fundamental bug is
that Java has misunderstood the connection between \b and \w regexes, so
that now a string like "élève" is not matched by the pattern "\b\w+\b" at
any point in the string.

Other very serious problems include Java's unjustifiable demotion of legal
Unicode whitespace characters from the set of whitespace characters
(breaking tokenization), using Unicode property names in ways contrary to
what the spec says they do, and in general supporting no Unicode properties
any later than 3.0: even the critical Unicode 3.1 properties are ignored by
Java. These are very serious problems. Java almost cannot be said to
support Unicode--at least any Unicode release from the last ten
years--until these critical deficiencies are fixed.

You can find a brief synopsis of these specific troubles as well as a link
to the Java code that fixes them here:

http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261

I don't by any means think this is the best way to go about this. It's
just a band-aide we needed quickly to allow us to move on with our work.
I'd like to offer it as a starting point for discussion of the issues that
prompted its creation.

As I mentioned, I have a couple dozen different Java Unicode issues, and
this addresses just one or two of them. When I get time, I'll try to bring
up the others here in separate threads.

If you could advise me how best to contribute to helping out here, I would
be grateful.

Thank you,

--tom

Re: Java and Unicode

Reply via email to