Mark Davis ☕ wrote
on Mon, 19 Sep 2011 14:41:49 PDT:
> I agree with the first part, disallowing the irregular code sequences.
Finding that Java allowed surrogates to sneak through in their UTF-8
streams like that was quite odd.
> As to the noncharacters, it would be a horrible mistake to di
Does anybody know anything about the Java UTF-8 encoder? It seems to be broken
in a couple (actually, three) of ways.
* First, it allows for intermixed CESU-8 and UTF-8 even though you
specify UTF-8, when it should be throwing an exception on the CESU-8.
It also allows unpaired surrog
Just a quick followup. This bug with equalsIgnoreCase working only for
BMP alone went undetected all the way up through Unicode 3.1. That's
when the Deseret script was introduced, which is a case-changing script
outside the BMP. That was more than 10 years ago now. Obviously no one
is screaming
I've discovered some errors in Java's case insensitive methods
for its String class. Its equalsIgnoreCase() is the most
obvious one that gets things wrong, but there are several
others as well.
There is inarguably at least one significant bug, and quite plausibly
several others as well. I've loo
rist
2 application/octet-stream 1560 the nftest(-v1).java program as octets
name="nftest-v1.java"
filename="nftest-v1.java"
3 text/plain1560 the nftest(-v2).java program as plain
text
nftest-v1.java
Description: the
>I have filed CR/RFE 7036910:
>j.l.Character.toLowerCaseCharArray/toTitleCaseCharArray for this request.
Thanks very much.
> The j.l.Character.toLowerCase/toUpperCase() suggests to use
> String.toLower/UpperCase() for case mapping, if you want 1:M mapping
> taken care. And if you trust the API:-
Andy Heninger wrote:
>>> I actually had do this because I have a dataset that has things like
>>> "undeaðlich" nad "smørrebrød", and I wanted to allow the user to
>>> head-match with "undead" and "smor", respectively. There is no
>>> decomposition of "ð" that includes "d", nor any of "ø" that in
Sherman,
While I was fixing your docs for j.l.Character, I kept the Unicode
6.0 specs close at hand to make sure everything was up to date. That's
how I was able to discover that one could safely update this comment
that noted that 1:M uppercasings happen only in the BMP:
-// As of
Xueming, the docs look good.
On the name of the flag, I have no strong feelings one way or the other.
Perhaps between UNICODE_PROPERTIES and UNICODE_CLASSES, I would prefer
the second one. The first makes me think of the regular properties like
\p{Script=Greek} from RL1.2, not the compat proper
Sherman,
The comparison to Perl 5 in the Java Pattern class documentation needs
to be corrected. However, I would not recommend as long a laundry list
of missing features from either side as the following email might imply.
I'm just trying to be complete, but in doing so, it produces a list that
Mark Davis ☕ wrote
on Sat, 23 Apr 2011 09:09:55 PDT:
> The changes sound good.
They sure do, don't they? I'm quite happy about this. I think it is more
important to get this in the queue than that it (necessarily) be done for
JDK7. That said, having a good tr18 RL1 story for JDK7's Unico
Thanks, Sherman.
I am obviously very much in favor of this happening, and
preferably sooner not later.
There are other benefits that derive from this, too. It adds to
j.l.Character three important properties needed to complete the
requirements of RL1.2. Once one can get at Lowercase, Uppercase,
I hope you all know there is a lot of handwaving at the end of my last
posting. :) That's because it isn't actually implementable as things stand.
There's no current way to track what was a single grapheme before the regex
gets its hands on it if that regex engine is doing some sort of decompositi
Thanks, Mark.
I've been trying to think about what to say to it.
I'd like to more about what is planned in the "canonical matching" area.
I do understand why reordering makes exact matching impossible. However,
I should think one of several sort of loose matching might still be done.
Maybe that
Sherman,
The other code thing that I saw, but also of course did not fix given
where you are in the release cycle, was another of these mysterious
non-parallel things. You have a
String getName(int codePoint)
function (well, static method) which takes a code point (like U+0130) and
produ
Sherman,
In the spirit of open source development and the whole Open JDK, I offer
all you hardworking folks this patch to j.l.Character's embedded javadoc.
(I also have some comments on the code, but those I'll send under
separate cover.)
I set out to fix nothing more than the "errors of commiss
I wonder whether anyone here has considered the Unicode matters currently
up for public review, and how they do or do not impact Java. Their closing
dates are coming up on us quickly, and several of them definitely bear
discussion:
http://www.unicode.org/review/
No. Title
I'm happy to see that
http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
includes a lot of great new stuff for Java regular expressions. I'm
specifically excited about named captures via (?...) and \k,
and the new \x{XXX} escape to allow code points to be specified lo
Sherman wrote:
> The difference is at
> test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$");
> test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$");
> You can have unpaired surrogate in Java String, but
> if you have a paired one you can't say I want them to
> be two separated "unpaired"
Sherman wrote about a really cool \x{} patch.
> I assume if I can have this \x{...} in (7), we all agree we are
> done with RL1.1?:-)
Absolutely!!
--tom
> Mark's LITERALS test is something like this
> String s = new StringBuilder().appendCodePoint(i).toString();
> String target = "a" + s + "b";
> Failures.LITERALS.checkMatch(i, "a" + s + "b", target);
> In which it does not escape those meta characters. Some are
> si
Sherman wrote:
> Oh, I see the problem. Obviously I have been working on jdk7 too long
> and forgot the latest release is still 6:-( There is indeed a bug in
> the previous implementation which I fixed in 7 long time ago (I
> mentioned this in one of the early emails but was not specific, my
> apo
Mark wrote:
> We are coming up to a quarterly Unicode Technical Committee meeting
> (starting Feb 7), so there is the opportunity to make requests / proposals
> about UTS18. In particular, if there are areas of the spec that are unclear
> or features that people would like to see added or changed,
Under the RL2.2 link of tr18, there appears to be a error:
C2. An implementation claiming conformance to Level 2 of this
specification shall satisfy C1, and meet the requirements
described in the following sections:
RL2.1 Canonical Equivalents
RL2.2 Extended G
On Monday, 24 January 2011 at 14:39:59 +0900,
Masayoshi Okutsu wrote
>>> Are you talking about unpaired surrogates or something else?
>> Yes, I am talking about unpaired surrogates.
> I believe each code unit of UTF-16 gets converted to its code point. So,
> an unpaired surrogate gets conver
Mark wrote:
> The Unicode Standard distinguishes between Unicode Strings (16-bit) and
> UTF-16. In the former, which is often the form used in programming
> languages, a singleton value of 0xD800..0xDFFF is allowed, and is treated
> as if it were a reserved code point.
Ahah! "Unicode Strings (16
Sherman wrote:
> The CR# so far I have are
> 7014645: Support Perl style Unicode hex notation \x{...}
> 7014633: Support loose matching forboth abbreviated and longer names of
> Unicode priperty
> 7014640: Add meta character for line ending '\R'
> It might take a couple days(?) for these CR# to
> The fact that these POSIX/ASCII only version properties/constructs
> have been there for years ("compatibility") and it appears that "most"
> developers are happy (habit, performance...) with them, I don't think
> we can and want to switch to the Unicode version, simply for
> conformance.
I agr
That concludes my discoveries, analysis, and remediations related
to j.u.r.Pattern's conformance with tr18's Level 1 requirements.
I would be interested in guidance toward how I can best help you
now that all that is done.
Would you all like some time to absorb and digest this set of
writings fr
Now I will discuss the more interesting of my two functions, the one that
handles charclass escapes such as those given in RL1.2a. The particular
Level 1 place where this code is relevant is RL1.2a's Annex C Compatibility
Properties, RL 1.4 Simple Word Boundaries, and RL1.6 Line Boundaries.
I do
When I set about resolving the Unicode troubles in Java regular
expressions through rewriting them into something Java understood,
I found it convenient to divide that functionality into two different
rewriting functions, one to handle string escapes like \u and the
other to handle charclas
Sherman, referring to Java's ASCII-only senses of \w and \s,
and of \p{alpha} and \p{space}, wrote:
> (does Perl 5 work in this way as well?)
No, not for a very, very long time. For most of Perl's life,
charclass escapes like \w have always been Unicode aware.
However, it did take us some time
Sherman,
Since you're looking through my messages for potential RFEs,
I thought I would point a pair of low-hanging fruit for you.
tr18 contains two distinct strong recommendations, both of which should
be quite easy to convert into RFEs. As recommendations, even strong
ones, they are of course
Sherman wrote:
> Introducing in the new perl style \x{...} as the hexadecimal notation
> appears to be a nice-to-have enhancement (I will file a RFE to put this
> request in record). But I don't think you can simply deny that the Java
> Unicode escape sequences for UTF16 is NOT A "mechanism"/notat
Sherman wrote:
> Regarding RL1.4.(1), the U+200C and U+2000 are obviously a bug that
> the Java regex failed to update the implementation to sync with the
> tr#18 update, it appears these two don't "exists" in RL1.4/v9,
> neither does RL1.2a, the compatibility properties.
> The words for 1.4(1)
Sherman wrote:
> Thanks for the detailed and excellent "reality check". While I'm still
> going through all the details it appears that the fact the current
> Java Unicode property data does not include the properties defined in
> PropList.txt (current implementation reads the property data only f
> Are you talking about unpaired surrogates or something else?
Yes, I am talking about unpaired surrogates.
--tom
In the JDK7 Pattern documentation, it says:
This class is in conformance with Level 1 of Unicode
Technical Standard #18: Unicode Regular Expression
Guidelines, plus RL2.1 Canonical Equivalents.
But the very first thing in tr18's conformance section reads:
C0. An implementation cl
In this message I cover only those errors made in the final
section ("Comparison to Perl 5") of:
http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
I really hope no one is offended by this. I don't mean to be
a nitpicker. Technical errors in the documentation should b
Here is a summary of my findings:
Compliance Req Num Description
??? RL1.1 Hex Notation
no RL1.2 Properties
no RL1.2a Compatibility Properties
yes RL1.3 Subtraction and Intersection
no
I am somewhat uncertain, but I believe that Java
*almost* meets this requirement.
1.7 Code Points
A fundamental requirement is that Unicode text be interpreted
semantically by code point, not code units.
RL1.7 Supplementary Code Points
To meet this requirement, an
Java meets this requirement, but only just barely.
RL1.6 Line Boundaries
To meet this requirement, if an implementation provides for
line-boundary testing, it shall recognize not only CRLF, LF,
CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028).
The reason I say
Java meets this requirement:
RL1.5 Simple Loose Matches
To meet this requirement, if an implementation provides for
case-insensitive matching, then it shall provide at least the
simple, default Unicode case-insensitive matching.
To meet this requirement, if an implement
Java does not meet this requirement. Specifically, it
does not offer a mechanism for stipulation #1 cited below:
RL1.4 Simple Word Boundaries
To meet this requirement, an implementation shall extend the
word boundary mechanism so that:
(1) The class of includes all the A
RL1.3 Subtraction and Intersection
To meet this requirement, an implementation shall supply
mechanisms for union, intersection and set-difference of
Unicode sets.
Java meets this requirement. However, because RL1.2 is not met,
it is of limited practical usefulness.
This is
RL1.2a Compatibility Properties
To meet this requirement, an implementation shall provide the
properties listed in Annex C. Compatibility Properties, with the
property values as listed there. Such an implementation shall
document whether it is using the Standard Recommenda
This message explains precisely how Java fails to provide
any way to access these four required properties from RL1.2:
Alphabetic
Lowercase
Uppercase
Whitespace
Since Java does not provide them *by any name*, and RL1.2 specifically
includes those four in its "To meet this requirem
Sherman wrote:
> the \p{Lower/Upper/Alpha/Space} are specified/implemented for POSIX
> version, which is clearly documented in the API document.
I don't see how you can use Unicode names and give them non-Unicode
meanings. That doesn't seem fair.
Perl had the same problem for a long time. We f
Sherman wrote:
> The Unicode/java version of lowercase, uppercase, withespace
> and letter character classes are provided via \p{javaXYZ},
I'm afraid that is *not* true; please see part 2.
> and the \p{Lower/Upper/Alpha/Space} are specified/implemented
> for POSIX version, which is clearly docum
Java does not meet the requirement of RL1.2. It provides only 3 of the 11
require properties; 4 it omits altogether, while 4 others it implements in
a fashion contrary to the standard. Java also neglects the strongly
recommended aspects of this section, which is quite a pity.
>From tr18:
RL
Sherman wrote:
> As of the Unicode support in j.l.Character class,
>> What I most dearly love to see Java would be brought fully up to date
>> so that its basic Character class supports whatever the current Unicode
>> release happens to be. Wouldn't that be great?
> Java language specification
Sherman,
In part 1, I outlined my thinking of why having to make end-users think
about represenation issues in regexes goes against if not perhaps the law,
certainly to mind the spirit of UTS(tr)#18 when it says that a compliant
"the regular expression engine provides support for Unicode character
Sherman,
Thank you so much for going out of your way to get your message
to me, all despite my broken mailer. Thanks to your help, I think
I have finally managed to wrestle it into working right. But that's
what I said last time, too, so we shall we.
> Introducing in the new perl style \x{...}
I've just cleared still more dynamic blacklist entry for Oracle's MX
servers, including rcsinet11.oracle.com [148.87.113.123]. If someone
from within Oracle could please send me mail, I'd like to test that
the way here is truly cleared again.
This is happening because you have a compromised ma
Here's the first requirement that must be met to claim Level 1 compliance.
Java does not yet meet this requirement, but it could easily do so: indeed,
my own regex-rewriting library implements this requirement. It takes very
little code at all and is *completely* backwards compatible because it u
> Thanks for looking into the Unicode support issues in Java RegEx.
> Since you have been working on Unicode in the past decade, I'm sure
> you understand that most of the issues you are pointing out here
> belongs to the "Extended Unicode Support: Level 2" as documented in
> UTS#18 Unicode Regula
Sherman wrote:
> So even certain Unicode Properties are not yet supported by
> Java RegEx, it does not means they are not supported by the
> platform, you should be able to access those Unicode properties
> via java.lang.Character class.
Sherman, you're 100% right about that.
One case in point t
Sherman wrote:
> I would also like to point out that Java is NOT a RegEx based
> language/platform, RegEx is not part of the Java language (I
> means the language specification), it is one of the utility
> packages in Java platform's core libraries.
I *do* understand what sorts of compromises ar
Sherman wrote:
> At the end, Java RegEx is NOT a Unicode RegEx, while it
> supports Unicode RegEx at certain level, sometime via different
> syntax, I don't feel this is a big problem for most Java
> developers and should not be a stopper for most program.
I do not understand what you mean when y
Sherman,
Thank you very, very much for your mail to me back on:
From: Xueming Shen
Date: Sun, Dec 12, 2010 at 7:01 AM
Subject: Re: Java and Unicode
To: Tom Christiansen
Cc: i18n-dev@openjdk.java.net
It was quite some time, however, before I saw it. That's becau
Good morning,
I'm Tom Christiansen; some of you may know me from my work in the Perl
Community. I'm here at the urging of Martijn Verburg, who thought that my
recent discoveries should be heard by your group.
I've been professionally programming for more than 25 years now, mostl
61 matches
Mail list logo