Re: RL1.2 Properties (part 1 of 2)

2011-01-23 Thread Xueming Shen
Tom, The Unicode/java version of lowercase, uppercase, withespace and letter character classes are provided via \p{javaXYZ}, and the \p{Lower/Upper/Alpha/Space} are specified/implemented for POSIX version, which is clearly documented in the API document. I would not use "worst" for this. I do

Re: RL1.2 Properties (part 1 of 2)

2011-01-23 Thread Tom Christiansen
Sherman wrote: > The Unicode/java version of lowercase, uppercase, withespace > and letter character classes are provided via \p{javaXYZ}, I'm afraid that is *not* true; please see part 2. > and the \p{Lower/Upper/Alpha/Space} are specified/implemented > for POSIX version, which is clearly docum

Re: RL1.2 Properties (part 1 of 2)

2011-01-23 Thread Tom Christiansen
Sherman wrote: > the \p{Lower/Upper/Alpha/Space} are specified/implemented for POSIX > version, which is clearly documented in the API document. I don't see how you can use Unicode names and give them non-Unicode meanings. That doesn't seem fair. Perl had the same problem for a long time. We f

RL1.2 Properties (part 2 of 2)

2011-01-23 Thread Tom Christiansen
This message explains precisely how Java fails to provide any way to access these four required properties from RL1.2: Alphabetic Lowercase Uppercase Whitespace Since Java does not provide them *by any name*, and RL1.2 specifically includes those four in its "To meet this requirem

RL1.2 Compatibility Properties

2011-01-23 Thread Tom Christiansen
RL1.2a Compatibility Properties To meet this requirement, an implementation shall provide the properties listed in Annex C. Compatibility Properties, with the property values as listed there. Such an implementation shall document whether it is using the Standard Recommenda

RL1.3 Subtraction and Intersection

2011-01-23 Thread Tom Christiansen
RL1.3 Subtraction and Intersection To meet this requirement, an implementation shall supply mechanisms for union, intersection and set-difference of Unicode sets. Java meets this requirement. However, because RL1.2 is not met, it is of limited practical usefulness. This is

RL1.4 Simple Word Boundaries

2011-01-23 Thread Tom Christiansen
Java does not meet this requirement. Specifically, it does not offer a mechanism for stipulation #1 cited below: RL1.4 Simple Word Boundaries To meet this requirement, an implementation shall extend the word boundary mechanism so that: (1) The class of includes all the A

RL1.5 Simple Loose Matches

2011-01-23 Thread Tom Christiansen
Java meets this requirement: RL1.5 Simple Loose Matches To meet this requirement, if an implementation provides for case-insensitive matching, then it shall provide at least the simple, default Unicode case-insensitive matching. To meet this requirement, if an implement

RL1.6 Line Boundaries

2011-01-23 Thread Tom Christiansen
Java meets this requirement, but only just barely. RL1.6 Line Boundaries To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028). The reason I say

RL1.7 Code Points

2011-01-23 Thread Tom Christiansen
I am somewhat uncertain, but I believe that Java *almost* meets this requirement. 1.7 Code Points A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units. RL1.7 Supplementary Code Points To meet this requirement, an

Summary of tr18 Level 1 compliance findings

2011-01-23 Thread Tom Christiansen
Here is a summary of my findings: Compliance Req Num Description ??? RL1.1 Hex Notation no RL1.2 Properties no RL1.2a Compatibility Properties yes RL1.3 Subtraction and Intersection no

j.u.r.Pattern documentation errors

2011-01-23 Thread Tom Christiansen
In this message I cover only those errors made in the final section ("Comparison to Perl 5") of: http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html I really hope no one is offended by this. I don't mean to be a nitpicker. Technical errors in the documentation should b

Suggested corrections to the Level 1 conformance statement

2011-01-23 Thread Tom Christiansen
In the JDK7 Pattern documentation, it says: This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents. But the very first thing in tr18's conformance section reads: C0. An implementation cl

Re: RL1.7 Code Points

2011-01-23 Thread Masayoshi Okutsu
Are you talking about unpaired surrogates or something else? Thanks, Masayoshi On 1/24/2011 5:22 AM, Tom Christiansen wrote: I am somewhat uncertain, but I believe that Java *almost* meets this requirement. 1.7 Code Points A fundamental requirement is that Unicode text be interprete

Re: RL1.7 Code Points

2011-01-23 Thread Tom Christiansen
> Are you talking about unpaired surrogates or something else? Yes, I am talking about unpaired surrogates. --tom

Re: RL1.7 Code Points

2011-01-23 Thread Masayoshi Okutsu
I believe each code unit of UTF-16 gets converted to its code point. So, an unpaired surrogate gets converted to a surrogate code point. So, it's still processed based on code points? Masayoshi On 1/24/2011 2:22 PM, Tom Christiansen wrote: Are you talking about unpaired surrogates or somethin

Re: RL1.4 Simple Word Boundaries

2011-01-23 Thread Xueming Shen
Tom, Thanks for the detailed and excellent "reality check". While I'm still going through all the details it appears that the fact the current Java Unicode property data does not include the properties defined in PropList.txt (current implementation reads the property data only from UnicodeDat

Re: j.u.r.Pattern documentation errors

2011-01-23 Thread Xueming Shen
Thanks Tom. That part of doc definitely need re-visit, it was written before 2002 (probably is against Perl 5.6) and have not been touched since, lots are no longer true given the latest 5.12. -Sherman On 1-23-2011 14:14 02:14 PM, Tom Christiansen wrote: In this message I cover only those e