Re: Java encoder errors

2011-09-19 Thread Tom Christiansen
Mark Davis ☕ wrote on Mon, 19 Sep 2011 14:41:49 PDT: > I agree with the first part, disallowing the irregular code sequences. Finding that Java allowed surrogates to sneak through in their UTF-8 streams like that was quite odd. > As to the noncharacters, it would be a horrible mistake to di

Java encoder errors

2011-09-19 Thread Tom Christiansen
Does anybody know anything about the Java UTF-8 encoder? It seems to be broken in a couple (actually, three) of ways. * First, it allows for intermixed CESU-8 and UTF-8 even though you specify UTF-8, when it should be throwing an exception on the CESU-8. It also allows unpaired surrog

Re: Errors in Java casing

2011-08-12 Thread Tom Christiansen
Just a quick followup. This bug with equalsIgnoreCase working only for BMP alone went undetected all the way up through Unicode 3.1. That's when the Deseret script was introduced, which is a case-changing script outside the BMP. That was more than 10 years ago now. Obviously no one is screaming

Errors in Java casing

2011-08-10 Thread Tom Christiansen
I've discovered some errors in Java's case insensitive methods for its String class. Its equalsIgnoreCase() is the most obvious one that gets things wrong, but there are several others as well. There is inarguably at least one significant bug, and quite plausibly several others as well. I've loo

Is(n't) this a Java Unicode compiler bug? [4=OSCON]

2011-07-19 Thread Tom Christiansen
rist 2 application/octet-stream 1560 the nftest(-v1).java program as octets name="nftest-v1.java" filename="nftest-v1.java" 3 text/plain1560 the nftest(-v2).java program as plain text nftest-v1.java Description: the

Re: java.lang.Character lacuna #1 of 2

2011-04-26 Thread Tom Christiansen
>I have filed CR/RFE 7036910: >j.l.Character.toLowerCaseCharArray/toTitleCaseCharArray for this request. Thanks very much. > The j.l.Character.toLowerCase/toUpperCase() suggests to use > String.toLower/UpperCase() for case mapping, if you want 1:M mapping > taken care. And if you trust the API:-

Re: Proposed update to UTS#18

2011-04-26 Thread Tom Christiansen
Andy Heninger wrote: >>> I actually had do this because I have a dataset that has things like >>> "undeaðlich" nad "smørrebrød", and I wanted to allow the user to >>> head-match with "undead" and "smor", respectively. There is no >>> decomposition of "ð" that includes "d", nor any of "ø" that in

java.lang.Character lacuna #1 of 2

2011-04-26 Thread Tom Christiansen
Sherman, While I was fixing your docs for j.l.Character, I kept the Unicode 6.0 specs close at hand to make sure everything was up to date. That's how I was able to discover that one could safely update this comment that noted that 1:M uppercasings happen only in the BMP: -// As of

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-24 Thread Tom Christiansen
Xueming, the docs look good. On the name of the flag, I have no strong feelings one way or the other. Perhaps between UNICODE_PROPERTIES and UNICODE_CLASSES, I would prefer the second one. The first makes me think of the regular properties like \p{Script=Greek} from RL1.2, not the compat proper

Suggested Perl-related updates for Pattern doc

2011-04-23 Thread Tom Christiansen
Sherman, The comparison to Perl 5 in the Java Pattern class documentation needs to be corrected. However, I would not recommend as long a laundry list of missing features from either side as the following email might imply. I'm just trying to be complete, but in doing so, it produces a list that

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-23 Thread Tom Christiansen
Mark Davis ☕ wrote on Sat, 23 Apr 2011 09:09:55 PDT: > The changes sound good. They sure do, don't they? I'm quite happy about this. I think it is more important to get this in the queue than that it (necessarily) be done for JDK7. That said, having a good tr18 RL1 story for JDK7's Unico

Re: Review request: 7037261: j.l.Character.isLowerCase/isUpperCase need to match the Unicode Standard definition

2011-04-19 Thread Tom Christiansen
Thanks, Sherman. I am obviously very much in favor of this happening, and preferably sooner not later. There are other benefits that derive from this, too. It adds to j.l.Character three important properties needed to complete the requirements of RL1.2. Once one can get at Lowercase, Uppercase,

Re: Proposed update to UTS#18

2011-04-15 Thread Tom Christiansen
I hope you all know there is a lot of handwaving at the end of my last posting. :) That's because it isn't actually implementable as things stand. There's no current way to track what was a single grapheme before the regex gets its hands on it if that regex engine is doing some sort of decompositi

Re: Proposed update to UTS#18

2011-04-14 Thread Tom Christiansen
Thanks, Mark. I've been trying to think about what to say to it. I'd like to more about what is planned in the "canonical matching" area. I do understand why reordering makes exact matching impossible. However, I should think one of several sort of loose matching might still be done. Maybe that

java.lang.Character lacuna #2 of 2

2011-04-14 Thread Tom Christiansen
Sherman, The other code thing that I saw, but also of course did not fix given where you are in the release cycle, was another of these mysterious non-parallel things. You have a String getName(int codePoint) function (well, static method) which takes a code point (like U+0130) and produ

DOC PATCH: java.lang.Character fixes (doc only, not code)

2011-04-14 Thread Tom Christiansen
Sherman, In the spirit of open source development and the whole Open JDK, I offer all you hardworking folks this patch to j.l.Character's embedded javadoc. (I also have some comments on the code, but those I'll send under separate cover.) I set out to fix nothing more than the "errors of commiss

Unicode Public Review Issues: the clock is ticking

2011-04-13 Thread Tom Christiansen
I wonder whether anyone here has considered the Unicode matters currently up for public review, and how they do or do not impact Java. Their closing dates are coming up on us quickly, and several of them definitely bear discussion: http://www.unicode.org/review/ No. Title

JDK7 and Unicode regular expressions

2011-04-13 Thread Tom Christiansen
I'm happy to see that http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html includes a lot of great new stuff for Java regular expressions. I'm specifically excited about named captures via (?...) and \k, and the new \x{XXX} escape to allow code points to be specified lo

Re: RL1.1 Hex Notation

2011-01-27 Thread Tom Christiansen
Sherman wrote: > The difference is at > test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$"); > test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$"); > You can have unpaired surrogate in Java String, but > if you have a paired one you can't say I want them to > be two separated "unpaired"

Re: RL1.1 Hex Notation

2011-01-27 Thread Tom Christiansen
Sherman wrote about a really cool \x{} patch. > I assume if I can have this \x{...} in (7), we all agree we are > done with RL1.1?:-) Absolutely!! --tom

Re: RL1.1 Hex Notation

2011-01-27 Thread Tom Christiansen
> Mark's LITERALS test is something like this > String s = new StringBuilder().appendCodePoint(i).toString(); > String target = "a" + s + "b"; > Failures.LITERALS.checkMatch(i, "a" + s + "b", target); > In which it does not escape those meta characters. Some are > si

Re: RL1.1 Hex Notation

2011-01-27 Thread Tom Christiansen
Sherman wrote: > Oh, I see the problem. Obviously I have been working on jdk7 too long > and forgot the latest release is still 6:-( There is indeed a bug in > the previous implementation which I fixed in 7 long time ago (I > mentioned this in one of the early emails but was not specific, my > apo

Re: UTS18 clarifications

2011-01-26 Thread Tom Christiansen
Mark wrote: > We are coming up to a quarterly Unicode Technical Committee meeting > (starting Feb 7), so there is the opportunity to make requests / proposals > about UTS18. In particular, if there are areas of the spec that are unclear > or features that people would like to see added or changed,

Possible error in tr18?

2011-01-26 Thread Tom Christiansen
Under the RL2.2 link of tr18, there appears to be a error: C2. An implementation claiming conformance to Level 2 of this specification shall satisfy C1, and meet the requirements described in the following sections: RL2.1 Canonical Equivalents RL2.2 Extended G

Re: RL1.7 Code Points

2011-01-26 Thread Tom Christiansen
On Monday, 24 January 2011 at 14:39:59 +0900, Masayoshi Okutsu wrote >>> Are you talking about unpaired surrogates or something else? >> Yes, I am talking about unpaired surrogates. > I believe each code unit of UTF-16 gets converted to its code point. So, > an unpaired surrogate gets conver

Re: RL1.1 Hex Notation (part 2 of 3)

2011-01-26 Thread Tom Christiansen
Mark wrote: > The Unicode Standard distinguishes between Unicode Strings (16-bit) and > UTF-16. In the former, which is often the form used in programming > languages, a singleton value of 0xD800..0xDFFF is allowed, and is treated > as if it were a reserved code point. Ahah! "Unicode Strings (16

Re: Now what?

2011-01-26 Thread Tom Christiansen
Sherman wrote: > The CR# so far I have are > 7014645: Support Perl style Unicode hex notation \x{...} > 7014633: Support loose matching forboth abbreviated and longer names of > Unicode priperty > 7014640: Add meta character for line ending '\R' > It might take a couple days(?) for these CR# to

Re: regex rewriting code (part 1 of 3)

2011-01-25 Thread Tom Christiansen
> The fact that these POSIX/ASCII only version properties/constructs > have been there for years ("compatibility") and it appears that "most" > developers are happy (habit, performance...) with them, I don't think > we can and want to switch to the Unicode version, simply for > conformance. I agr

Now what?

2011-01-25 Thread Tom Christiansen
That concludes my discoveries, analysis, and remediations related to j.u.r.Pattern's conformance with tr18's Level 1 requirements. I would be interested in guidance toward how I can best help you now that all that is done. Would you all like some time to absorb and digest this set of writings fr

regex rewriting code (part 3 of 3)

2011-01-25 Thread Tom Christiansen
Now I will discuss the more interesting of my two functions, the one that handles charclass escapes such as those given in RL1.2a. The particular Level 1 place where this code is relevant is RL1.2a's Annex C Compatibility Properties, RL 1.4 Simple Word Boundaries, and RL1.6 Line Boundaries. I do

regex rewriting code (part 2 of 3)

2011-01-25 Thread Tom Christiansen
When I set about resolving the Unicode troubles in Java regular expressions through rewriting them into something Java understood, I found it convenient to divide that functionality into two different rewriting functions, one to handle string escapes like \u and the other to handle charclas

regex rewriting code (part 1 of 3)

2011-01-25 Thread Tom Christiansen
Sherman, referring to Java's ASCII-only senses of \w and \s, and of \p{alpha} and \p{space}, wrote: > (does Perl 5 work in this way as well?) No, not for a very, very long time. For most of Perl's life, charclass escapes like \w have always been Unicode aware. However, it did take us some time

Converting tr18 "strong recommendations" into RFEs

2011-01-25 Thread Tom Christiansen
Sherman, Since you're looking through my messages for potential RFEs, I thought I would point a pair of low-hanging fruit for you. tr18 contains two distinct strong recommendations, both of which should be quite easy to convert into RFEs. As recommendations, even strong ones, they are of course

Re: RL1.1 Hex Notation

2011-01-24 Thread Tom Christiansen
Sherman wrote: > Introducing in the new perl style \x{...} as the hexadecimal notation > appears to be a nice-to-have enhancement (I will file a RFE to put this > request in record). But I don't think you can simply deny that the Java > Unicode escape sequences for UTF16 is NOT A "mechanism"/notat

Re: RL1.4 Simple Word Boundaries

2011-01-24 Thread Tom Christiansen
Sherman wrote: > Regarding RL1.4.(1), the U+200C and U+2000 are obviously a bug that > the Java regex failed to update the implementation to sync with the > tr#18 update, it appears these two don't "exists" in RL1.4/v9, > neither does RL1.2a, the compatibility properties. > The words for 1.4(1)

Re: RL1.4 Simple Word Boundaries (actually, RL1.2 & RL1.2a)

2011-01-24 Thread Tom Christiansen
Sherman wrote: > Thanks for the detailed and excellent "reality check". While I'm still > going through all the details it appears that the fact the current > Java Unicode property data does not include the properties defined in > PropList.txt (current implementation reads the property data only f

Re: RL1.7 Code Points

2011-01-23 Thread Tom Christiansen
> Are you talking about unpaired surrogates or something else? Yes, I am talking about unpaired surrogates. --tom

Suggested corrections to the Level 1 conformance statement

2011-01-23 Thread Tom Christiansen
In the JDK7 Pattern documentation, it says: This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents. But the very first thing in tr18's conformance section reads: C0. An implementation cl

j.u.r.Pattern documentation errors

2011-01-23 Thread Tom Christiansen
In this message I cover only those errors made in the final section ("Comparison to Perl 5") of: http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html I really hope no one is offended by this. I don't mean to be a nitpicker. Technical errors in the documentation should b

Summary of tr18 Level 1 compliance findings

2011-01-23 Thread Tom Christiansen
Here is a summary of my findings: Compliance Req Num Description ??? RL1.1 Hex Notation no RL1.2 Properties no RL1.2a Compatibility Properties yes RL1.3 Subtraction and Intersection no

RL1.7 Code Points

2011-01-23 Thread Tom Christiansen
I am somewhat uncertain, but I believe that Java *almost* meets this requirement. 1.7 Code Points A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units. RL1.7 Supplementary Code Points To meet this requirement, an

RL1.6 Line Boundaries

2011-01-23 Thread Tom Christiansen
Java meets this requirement, but only just barely. RL1.6 Line Boundaries To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028). The reason I say

RL1.5 Simple Loose Matches

2011-01-23 Thread Tom Christiansen
Java meets this requirement: RL1.5 Simple Loose Matches To meet this requirement, if an implementation provides for case-insensitive matching, then it shall provide at least the simple, default Unicode case-insensitive matching. To meet this requirement, if an implement

RL1.4 Simple Word Boundaries

2011-01-23 Thread Tom Christiansen
Java does not meet this requirement. Specifically, it does not offer a mechanism for stipulation #1 cited below: RL1.4 Simple Word Boundaries To meet this requirement, an implementation shall extend the word boundary mechanism so that: (1) The class of includes all the A

RL1.3 Subtraction and Intersection

2011-01-23 Thread Tom Christiansen
RL1.3 Subtraction and Intersection To meet this requirement, an implementation shall supply mechanisms for union, intersection and set-difference of Unicode sets. Java meets this requirement. However, because RL1.2 is not met, it is of limited practical usefulness. This is

RL1.2 Compatibility Properties

2011-01-23 Thread Tom Christiansen
RL1.2a Compatibility Properties To meet this requirement, an implementation shall provide the properties listed in Annex C. Compatibility Properties, with the property values as listed there. Such an implementation shall document whether it is using the Standard Recommenda

RL1.2 Properties (part 2 of 2)

2011-01-23 Thread Tom Christiansen
This message explains precisely how Java fails to provide any way to access these four required properties from RL1.2: Alphabetic Lowercase Uppercase Whitespace Since Java does not provide them *by any name*, and RL1.2 specifically includes those four in its "To meet this requirem

Re: RL1.2 Properties (part 1 of 2)

2011-01-23 Thread Tom Christiansen
Sherman wrote: > the \p{Lower/Upper/Alpha/Space} are specified/implemented for POSIX > version, which is clearly documented in the API document. I don't see how you can use Unicode names and give them non-Unicode meanings. That doesn't seem fair. Perl had the same problem for a long time. We f

Re: RL1.2 Properties (part 1 of 2)

2011-01-23 Thread Tom Christiansen
Sherman wrote: > The Unicode/java version of lowercase, uppercase, withespace > and letter character classes are provided via \p{javaXYZ}, I'm afraid that is *not* true; please see part 2. > and the \p{Lower/Upper/Alpha/Space} are specified/implemented > for POSIX version, which is clearly docum

RL1.2 Properties (part 1 of 2)

2011-01-22 Thread Tom Christiansen
Java does not meet the requirement of RL1.2. It provides only 3 of the 11 require properties; 4 it omits altogether, while 4 others it implements in a fashion contrary to the standard. Java also neglects the strongly recommended aspects of this section, which is quite a pity. >From tr18: RL

Re: RL1.1 Hex Notation (part 3 of 3)

2011-01-22 Thread Tom Christiansen
Sherman wrote: > As of the Unicode support in j.l.Character class, >> What I most dearly love to see Java would be brought fully up to date >> so that its basic Character class supports whatever the current Unicode >> release happens to be. Wouldn't that be great? > Java language specification

Re: RL1.1 Hex Notation (part 2 of 3)

2011-01-22 Thread Tom Christiansen
Sherman, In part 1, I outlined my thinking of why having to make end-users think about represenation issues in regexes goes against if not perhaps the law, certainly to mind the spirit of UTS(tr)#18 when it says that a compliant "the regular expression engine provides support for Unicode character

Re: RL1.1 Hex Notation (part 1 of 3)

2011-01-22 Thread Tom Christiansen
Sherman, Thank you so much for going out of your way to get your message to me, all despite my broken mailer. Thanks to your help, I think I have finally managed to wrestle it into working right. But that's what I said last time, too, so we shall we. > Introducing in the new perl style \x{...}

more Oracle MX troubles

2011-01-21 Thread Tom Christiansen
I've just cleared still more dynamic blacklist entry for Oracle's MX servers, including rcsinet11.oracle.com [148.87.113.123]. If someone from within Oracle could please send me mail, I'd like to test that the way here is truly cleared again. This is happening because you have a compromised ma

RL1.1 Hex Notation

2011-01-21 Thread Tom Christiansen
Here's the first requirement that must be met to claim Level 1 compliance. Java does not yet meet this requirement, but it could easily do so: indeed, my own regex-rewriting library implements this requirement. It takes very little code at all and is *completely* backwards compatible because it u

Level 1 Unicode support for Java regexes: overview

2011-01-21 Thread Tom Christiansen
> Thanks for looking into the Unicode support issues in Java RegEx. > Since you have been working on Unicode in the past decade, I'm sure > you understand that most of the issues you are pointing out here > belongs to the "Extended Unicode Support: Level 2" as documented in > UTS#18 Unicode Regula

Unicode support in class Character

2011-01-20 Thread Tom Christiansen
Sherman wrote: > So even certain Unicode Properties are not yet supported by > Java RegEx, it does not means they are not supported by the > platform, you should be able to access those Unicode properties > via java.lang.Character class. Sherman, you're 100% right about that. One case in point t

Java and regex-based languages

2011-01-20 Thread Tom Christiansen
Sherman wrote: > I would also like to point out that Java is NOT a RegEx based > language/platform, RegEx is not part of the Java language (I > means the language specification), it is one of the utility > packages in Java platform's core libraries. I *do* understand what sorts of compromises ar

Java Regexes vs Unicode Regexes

2011-01-20 Thread Tom Christiansen
Sherman wrote: > At the end, Java RegEx is NOT a Unicode RegEx, while it > supports Unicode RegEx at certain level, sometime via different > syntax, I don't feel this is a big problem for most Java > developers and should not be a stopper for most program. I do not understand what you mean when y

long-delayed response on Java and Unicode

2011-01-20 Thread Tom Christiansen
Sherman, Thank you very, very much for your mail to me back on: From: Xueming Shen Date: Sun, Dec 12, 2010 at 7:01 AM Subject: Re: Java and Unicode To: Tom Christiansen Cc: i18n-dev@openjdk.java.net It was quite some time, however, before I saw it. That's becau

Java and Unicode

2010-12-11 Thread Tom Christiansen
Good morning, I'm Tom Christiansen; some of you may know me from my work in the Perl Community. I'm here at the urging of Martijn Verburg, who thought that my recent discoveries should be heard by your group. I've been professionally programming for more than 25 years now, mostl