Tom,
The Unicode/java version of lowercase, uppercase, withespace and letter
character classes are
provided via \p{javaXYZ}, and the \p{Lower/Upper/Alpha/Space} are
specified/implemented
for POSIX version, which is clearly documented in the API document. I
would not use "worst"
for this. I do
Sherman wrote:
> The Unicode/java version of lowercase, uppercase, withespace
> and letter character classes are provided via \p{javaXYZ},
I'm afraid that is *not* true; please see part 2.
> and the \p{Lower/Upper/Alpha/Space} are specified/implemented
> for POSIX version, which is clearly docum
Sherman wrote:
> the \p{Lower/Upper/Alpha/Space} are specified/implemented for POSIX
> version, which is clearly documented in the API document.
I don't see how you can use Unicode names and give them non-Unicode
meanings. That doesn't seem fair.
Perl had the same problem for a long time. We f
This message explains precisely how Java fails to provide
any way to access these four required properties from RL1.2:
Alphabetic
Lowercase
Uppercase
Whitespace
Since Java does not provide them *by any name*, and RL1.2 specifically
includes those four in its "To meet this requirem
RL1.2a Compatibility Properties
To meet this requirement, an implementation shall provide the
properties listed in Annex C. Compatibility Properties, with the
property values as listed there. Such an implementation shall
document whether it is using the Standard Recommenda
RL1.3 Subtraction and Intersection
To meet this requirement, an implementation shall supply
mechanisms for union, intersection and set-difference of
Unicode sets.
Java meets this requirement. However, because RL1.2 is not met,
it is of limited practical usefulness.
This is
Java does not meet this requirement. Specifically, it
does not offer a mechanism for stipulation #1 cited below:
RL1.4 Simple Word Boundaries
To meet this requirement, an implementation shall extend the
word boundary mechanism so that:
(1) The class of includes all the A
Java meets this requirement:
RL1.5 Simple Loose Matches
To meet this requirement, if an implementation provides for
case-insensitive matching, then it shall provide at least the
simple, default Unicode case-insensitive matching.
To meet this requirement, if an implement
Java meets this requirement, but only just barely.
RL1.6 Line Boundaries
To meet this requirement, if an implementation provides for
line-boundary testing, it shall recognize not only CRLF, LF,
CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028).
The reason I say
I am somewhat uncertain, but I believe that Java
*almost* meets this requirement.
1.7 Code Points
A fundamental requirement is that Unicode text be interpreted
semantically by code point, not code units.
RL1.7 Supplementary Code Points
To meet this requirement, an
Here is a summary of my findings:
Compliance Req Num Description
??? RL1.1 Hex Notation
no RL1.2 Properties
no RL1.2a Compatibility Properties
yes RL1.3 Subtraction and Intersection
no
In this message I cover only those errors made in the final
section ("Comparison to Perl 5") of:
http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
I really hope no one is offended by this. I don't mean to be
a nitpicker. Technical errors in the documentation should b
In the JDK7 Pattern documentation, it says:
This class is in conformance with Level 1 of Unicode
Technical Standard #18: Unicode Regular Expression
Guidelines, plus RL2.1 Canonical Equivalents.
But the very first thing in tr18's conformance section reads:
C0. An implementation cl
Are you talking about unpaired surrogates or something else?
Thanks,
Masayoshi
On 1/24/2011 5:22 AM, Tom Christiansen wrote:
I am somewhat uncertain, but I believe that Java
*almost* meets this requirement.
1.7 Code Points
A fundamental requirement is that Unicode text be interprete
> Are you talking about unpaired surrogates or something else?
Yes, I am talking about unpaired surrogates.
--tom
I believe each code unit of UTF-16 gets converted to its code point. So,
an unpaired surrogate gets converted to a surrogate code point. So, it's
still processed based on code points?
Masayoshi
On 1/24/2011 2:22 PM, Tom Christiansen wrote:
Are you talking about unpaired surrogates or somethin
Tom,
Thanks for the detailed and excellent "reality check". While I'm still
going through all the details
it appears that the fact the current Java Unicode property data does not
include the properties
defined in PropList.txt (current implementation reads the property data
only from UnicodeDat
Thanks Tom.
That part of doc definitely need re-visit, it was written before 2002
(probably is
against Perl 5.6) and have not been touched since, lots are no longer
true given
the latest 5.12.
-Sherman
On 1-23-2011 14:14 02:14 PM, Tom Christiansen wrote:
In this message I cover only those e
18 matches
Mail list logo