date:20110423

Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-23 Thread Xueming Shen


 Hi

This proposal tries to address

(1)  j.u.regex does not meet Unicode regex's Simple Word Boundaries [1] 
requirement as Tom pointed

out in his email on i18n-dev list [2]. Basically we have 3 problems here.

a. ju.regex word boundary construct \b and \B uses Unicode 
\p{letter} + \p{digit} as the "word"
definition when the standard requires the true Unicode 
\p{Alphabetic} property be used instead.

It also neglects two of the specifically required characters:
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
(or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit 
+ \p{gc=Connector_Punctuation}, if

follow Annex C).
b. j.u.regex's word construct \w and \W are ASCII only version
c. It breaks the historical connection between word characters and 
word boundaries (because of
a) and b). For example "élève" is NOT matched by the \b\w+\b 
pattern)


(2) j.u.regex does not meet Unicode regex's Properties requirement 
[3][5][6][7]. Th main issues are


a. Alphabetic: totally missing from the platform, not only regex
b. Lowercase, Uppercase and White_Space: Java implementation (via 
\p{javaMethod} is different

compared to Unicode Standard definition.
c. j.u.regex's POSIX character classes are ASCII only, when 
standard has an Unicode version defined

at tr#18 Annex C [3]

As the solution, I propose to

(1) add a flag UNICODE_UNICODE to
a) flip the ASCII only predefined character classes (\b \B \w \W \d 
\D \s \S) and POSIX character

classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version
b) enable the UNICODE_CASE (anything Unicode)

While ideally we would like to just evolve/upgrade the Java regex 
from the aged "ascii-only"
to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)),  
like what Perl did. But
given the Java's "compatibility" spirit (and the performance 
concern as well), this is unlikely to

happen.

(2) add \p{IsBinaryProperty} to explicitly support some important 
Unicode binary properties, such
as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with 
this j.u.regex can easily access
some properties that are either not provided by j.l.Character 
directly or j.l.Character has a

different version (for example the White_Space).
(The missing alphabetic, different uppercase/lowercase issue has 
been/is being addressed at

Cr#7037261 [4], any reviewer?)

The webrev is at
http://cr.openjdk.java.net/~sherman/7039066/webrev/

The corresponding updated api j.u.regex.Pattern API doc is at
http://cr.openjdk.java.net/~sherman/7039066/Pattern.html

Specdiff result is at
http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html

I will file the CCC request if the API change proposal in webrev is 
approved. This is coming in very late
so it is possible that it may be held back until Java 8, if it can not 
make the cutoff for jdk7.


-Sherman


[1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
[2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html
[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[4] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html
[5] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html
[6] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html
[7] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-23 Thread Xueming Shen


 The flag this request proposed to add is

 UNICODE_CHARSET

not the "UNICODE_UNICODE" in last email.

My apology for the typo.

Any suggestion for a better name? It was UNICODE_CHARACTERCLASS, but then it
became UNICODE_CHARSET, considering the unicode_case.

-Sherman

On 4/23/2011 1:00 AM, Xueming Shen wrote:

 Hi

This proposal tries to address

(1)  j.u.regex does not meet Unicode regex's Simple Word Boundaries 
[1] requirement as Tom pointed

out in his email on i18n-dev list [2]. Basically we have 3 problems here.

a. ju.regex word boundary construct \b and \B uses Unicode 
\p{letter} + \p{digit} as the "word"
definition when the standard requires the true Unicode 
\p{Alphabetic} property be used instead.

It also neglects two of the specifically required characters:
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
(or the "word" could be \p{alphabetic} + \p{gc=Mark} + 
\p{digit + \p{gc=Connector_Punctuation}, if

follow Annex C).
b. j.u.regex's word construct \w and \W are ASCII only version
c. It breaks the historical connection between word characters and 
word boundaries (because of
a) and b). For example "élève" is NOT matched by the \b\w+\b 
pattern)


(2) j.u.regex does not meet Unicode regex's Properties requirement 
[3][5][6][7]. Th main issues are


a. Alphabetic: totally missing from the platform, not only regex
b. Lowercase, Uppercase and White_Space: Java implementation (via 
\p{javaMethod} is different

compared to Unicode Standard definition.
c. j.u.regex's POSIX character classes are ASCII only, when 
standard has an Unicode version defined

at tr#18 Annex C [3]

As the solution, I propose to

(1) add a flag UNICODE_UNICODE to
a) flip the ASCII only predefined character classes (\b \B \w \W 
\d \D \s \S) and POSIX character

classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version
b) enable the UNICODE_CASE (anything Unicode)

While ideally we would like to just evolve/upgrade the Java regex 
from the aged "ascii-only"
to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)),  
like what Perl did. But
given the Java's "compatibility" spirit (and the performance 
concern as well), this is unlikely to

happen.

(2) add \p{IsBinaryProperty} to explicitly support some important 
Unicode binary properties, such
as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with 
this j.u.regex can easily access
some properties that are either not provided by j.l.Character 
directly or j.l.Character has a

different version (for example the White_Space).
(The missing alphabetic, different uppercase/lowercase issue has 
been/is being addressed at

Cr#7037261 [4], any reviewer?)

The webrev is at
http://cr.openjdk.java.net/~sherman/7039066/webrev/

The corresponding updated api j.u.regex.Pattern API doc is at
http://cr.openjdk.java.net/~sherman/7039066/Pattern.html

Specdiff result is at
http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html

I will file the CCC request if the API change proposal in webrev is 
approved. This is coming in very late
so it is possible that it may be held back until Java 8, if it can not 
make the cutoff for jdk7.


-Sherman


[1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
[2] 
http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html

[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[4] 
http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html
[5] 
http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html
[6] 
http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html
[7] 
http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-23 Thread Mark Davis ☕

The changes sound good. The flag UNICODE_CHARSET will be misleading, since
all of Java uses the Unicode Charset (= encoding). How about:

UNICODE_SPEC

or something that gives that flavor.

Mark

*— Il meglio è l’inimico del bene —*


On Sat, Apr 23, 2011 at 01:12, Xueming Shen  wrote:

>  The flag this request proposed to add is
>
>  UNICODE_CHARSET
>
> not the "UNICODE_UNICODE" in last email.
>
> My apology for the typo.
>
> Any suggestion for a better name? It was UNICODE_CHARACTERCLASS, but then
> it
> became UNICODE_CHARSET, considering the unicode_case.
>
> -Sherman
>
>
> On 4/23/2011 1:00 AM, Xueming Shen wrote:
>
>>  Hi
>>
>> This proposal tries to address
>>
>> (1)  j.u.regex does not meet Unicode regex's Simple Word Boundaries [1]
>> requirement as Tom pointed
>> out in his email on i18n-dev list [2]. Basically we have 3 problems here.
>>
>>a. ju.regex word boundary construct \b and \B uses Unicode \p{letter} +
>> \p{digit} as the "word"
>>definition when the standard requires the true Unicode
>> \p{Alphabetic} property be used instead.
>>It also neglects two of the specifically required characters:
>>U+200C ZERO WIDTH NON-JOINER
>>U+200D ZERO WIDTH JOINER
>>(or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit +
>> \p{gc=Connector_Punctuation}, if
>>follow Annex C).
>>b. j.u.regex's word construct \w and \W are ASCII only version
>>c. It breaks the historical connection between word characters and word
>> boundaries (because of
>>a) and b). For example "élève" is NOT matched by the \b\w+\b
>> pattern)
>>
>> (2) j.u.regex does not meet Unicode regex's Properties requirement
>> [3][5][6][7]. Th main issues are
>>
>>a. Alphabetic: totally missing from the platform, not only regex
>>b. Lowercase, Uppercase and White_Space: Java implementation (via
>> \p{javaMethod} is different
>>compared to Unicode Standard definition.
>>c. j.u.regex's POSIX character classes are ASCII only, when standard
>> has an Unicode version defined
>>at tr#18 Annex C [3]
>>
>> As the solution, I propose to
>>
>> (1) add a flag UNICODE_UNICODE to
>>a) flip the ASCII only predefined character classes (\b \B \w \W \d \D
>> \s \S) and POSIX character
>>classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version
>>b) enable the UNICODE_CASE (anything Unicode)
>>
>>While ideally we would like to just evolve/upgrade the Java regex from
>> the aged "ascii-only"
>>to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)),  like
>> what Perl did. But
>>given the Java's "compatibility" spirit (and the performance concern as
>> well), this is unlikely to
>>happen.
>>
>> (2) add \p{IsBinaryProperty} to explicitly support some important Unicode
>> binary properties, such
>>as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with this
>> j.u.regex can easily access
>>some properties that are either not provided by j.l.Character directly
>> or j.l.Character has a
>>different version (for example the White_Space).
>>(The missing alphabetic, different uppercase/lowercase issue has
>> been/is being addressed at
>>Cr#7037261 [4], any reviewer?)
>>
>> The webrev is at
>> http://cr.openjdk.java.net/~sherman/7039066/webrev/
>>
>> The corresponding updated api j.u.regex.Pattern API doc is at
>> http://cr.openjdk.java.net/~sherman/7039066/Pattern.html
>>
>> Specdiff result is at
>> http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html
>>
>> I will file the CCC request if the API change proposal in webrev is
>> approved. This is coming in very late
>> so it is possible that it may be held back until Java 8, if it can not
>> make the cutoff for jdk7.
>>
>> -Sherman
>>
>>
>> [1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
>> [2]
>> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html
>> [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
>> [4]
>> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html
>> [5]
>> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html
>> [6]
>> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html
>> [7]
>> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html
>>
>
>

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-23 Thread Tom Christiansen

Mark Davis ☕  wrote
   on Sat, 23 Apr 2011 09:09:55 PDT: 

> The changes sound good. 

They sure do, don't they?  I'm quite happy about this.  I think it is more
important to get this in the queue than that it (necessarily) be done for
JDK7.  That said, having a good tr18 RL1 story for JDK7's Unicode 6.0 debut
makes it attractive now.  But if not now, then soon is good enough.

> The flag UNICODE_CHARSET will be misleading, since
> all of Java uses the Unicode Charset (= encoding). How about:

>   UNICODE_SPEC

> or something that gives that flavor.

I hadn't thought of that, but I do see what you mean.  The idea is 
that the semantics of \w etc change to match the Unicode spec in tr18.

I worry that UNICODE_SPEC, or even UNICODE_SEMANTICS, might be too
broad a brush.  What then happens when, as I imagine it someday shall,
Java gets full support for RL2.3 boundaries, the way with ICU one uses
or (?w) or UREGEX_UWORD for?  

Wouldn't calling something UNICODE_SPEC be too broad? Or should
UNICODE_SPEC automatically include not just existing Unicode flags
like UNICODE_CASE, but also any UREGEX_UWORD that comes along?  
If it does, you have back-compat issue, and if it doesn't, you 
have a misnaming issue.  Seems like a bit of a Catch22.

The reason I'd suggested UNICODE_CHARSET was because of my own background
with the names we use for this within the Perl regex source code (which is
itself written in C).  I believe that Java doesn't have the same situation
as gave rise to it in Perl, and perhaps something else would be clearer.

Here's some background for why we felt we had to go that way. To control
the behavior of \w and such, when a regex is compiled, a compiled Perl 
gets exactly one of these states:

REGEX_UNICODE_CHARSET
REGEX_LOCALE_CHARSET
REGEX_ASCII_RESTRICTED_CHARSET
REGEX_DEPENDS_CHARSET 

That state it normally inherits from the surrounding lexical scope,
although this can be overridden with /u and /a, or (?u) and (?a),
either within the pattern or as a separate pattern-compilation flag.

REGEX_UNICODE_CHARSET corresponds to out (?u), so \w and such all get the
full RL1.2a definitions.  Because Perl always does Unicode casemapping --
and full casemapping, too, not just simple -- we didn't need (?u) for what
Java uses it for, which is just as an extra flavor of (?i); it doesn't
do all that much.

(BTW, the old default is *not* some sort of non-Unicode charset
semantics, it's the ugly REGEX_DEPENDS_CHARSET, which is Unicode for
code points > 255 and "maybe" so in the 128-255 range.)

What we did certainly isn't perfect, but it allows for both backwards
compat and future growth.  This was because people want(ed) to be able to
use regexes on both byte arrays yet also on character strings.  Me, I think
it's nuts to support this at all, that if you want an input stream in (say)
CP1251 or ISO 8859-2, that you simply set that stream's encoding and be
done with it: everything turns into characters internally.  But there's old
byte and locale code out there whose semantics we are loth to change out
from under people.  Java has the same kind of issue.

The reason we ever support anything else is because we got (IMHO nasty)
POSIX locales before we got Unicode support, which didn't happen till
toward the end of the last millennium.  So we're stuck supporting code
well more than a decade old, perhaps indefinitely.  It's messy, but it
is very hard to do anything about that.  I think Java shares in that
perspective.

This corresponds, I think, to Java needing to support pre-Unicode
regex semantics on \w and related escapes.  If they had started out
with it always means the real thing the way ICU did, they wouldn't
need both.

I wish there were a pragma to control this on a per-lexical-scope basis,
but I'm don't enough about the Java compilers internals to begin to know
how to go about implementing some thing like that, even as a
-XX:+UseUnicodeSemantics CLI switch for that compilation unit.

One reason you want this is because the Java String class has these
"convenience" methods like matches, replaceAll, etc, that take regexes
but do not provide an API that admits Pattern compile flags.  If there
is no way to embed a (?U) directive or some such, nor any way to pass
in a Pattern.UNICODE_something flag.  The Java String API could also
be broadened through method signature overloading, but for now, you
can't do that.

No matter what the UNICODE_something gets called, I think there needs to be
a corresponding embeddable (?X)-style flag as well.  Even if String were
broadened, you'd want people to be able to specify *within the regex* that
that regex should have full Unicode semantics.  After all, they might read
the pattern in from a file.  That's why (most) Pattern.compile flags need
to be able to embedded, too.  But you knew that already. :)

--tom

Suggested Perl-related updates for Pattern doc

2011-04-23 Thread Tom Christiansen

Sherman, 

The comparison to Perl 5 in the Java Pattern class documentation needs
to be corrected.  However, I would not recommend as long a laundry list
of missing features from either side as the following email might imply.
I'm just trying to be complete, but in doing so, it produces a list that
I think is too unruly for inclusion.  Part of that, however, may be
because I have included a lot of auxiliarly information and examples to
show you what I mean.  Those of course don't need to go in the javadoc.

My minimal suggested change would be to bring it alignment with the
current production release of Perl instead of one from the 
previous millennium -- and in some cases, from much older still. 
Whether you choose 5.12 or 5.14, you should clearlyi state *which*
version of Perl you're comparing yourself with: it is the lack
of reference version number that caused this to become so false.

Sherman, you do a much better than I do in patching javadoc in a way
consistent in tone and texture, so I am comfortable leaving this 
to your discretion.

I hope this helps.  If there's anything more I can do to help,
please do not hesitate to ask.  Thank you for all your work; 
I am quite enthusiastic about all of this.

--tom

> Comparison to Perl 5 

This was applicable to 2000's Perl 5.6 release, and also to a
much older version of the Java Pattern class.  Both have advanced
beyond what the comparison claims.

> The Pattern engine performs traditional NFA-based matching with
> ordered alternation as occurs in Perl 5.

Although I agree that Perl and Java use the same sort of matcher, 
I'm not sure it is accurate to call it a traditional NFA matcher.  
Both are recursive backtracking matchers, necessitated by the 
backref support.  The difference between these two algorithms 
is well explained in Russ Cox's paper on

"Regular Expression Matching Can Be Simple And Fast 
 (but is slow in Java, Perl, PHP, Python, Ruby, ...)"

http://swtch.com/~rsc/regexp/regexp1.html

The Cox paper shows how pathological patterns cause a recursive
backtracking algorithm to degrade exponentially with respect to
input length, and how that does not occur under a traditional
NFA.  It is easy to demonstrate this issue from the command line:

$ time perl -le 'print(("a" x 19) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > 
/dev/null
2.803u 0.000s 0:02.80 
$ time perl -le 'print(("a" x 20) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > 
/dev/null
4.077u 0.002s 0:04.08
$ time perl -le 'print(("a" x 21) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > 
/dev/null
6.039u 0.003s 0:06.04 
$ time perl -le 'print(("a" x 22) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > 
/dev/null
8.756u 0.000s 0:08.76 

In contrast, if you swap in Cox's RE2 library (this is a CPAN module) in
place of Perl's default regex engine, that all disappears:

$ time perl -Mre::engine::RE2 -le 'print(("a" x 19)   =~ 
/a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
0.001u 0.003s 0:00.00 
$ time perl -Mre::engine::RE2 -le 'print(("a" x 50)   =~ 
/a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
0.002u 0.000s 0:00.00
$ time perl -Mre::engine::RE2 -le 'print(("a" x 500)  =~ 
/a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
0.001u 0.002s 0:00.00
$ time perl -Mre::engine::RE2 -le 'print(("a" x 5000) =~ 
/a*a*a*a*a*a*a*a*a*a*[Bb]i || 0)' > /dev/null
0.001u 0.000s 0:00.00

That's because Cox is using a traditional NFA, but Perl (by default) 
and Java (always) are both using a recursive backtracker variant
of the same.  Read Cox; he explains it more clearly than I have.

> Perl constructs not supported by this class:
>  The conditional constructs (?{X}) and (?(condition)X|Y),
>  The embedded code constructs (?{code}) and (??{code}),
>  The embedded comment syntax (?#comment), and
>  The preprocessing operations \l, \u, \L, and \U.

Well, yes, but those are string-interpolation things: they 
don't happen in the regex compiler; likewise \Q.  If you
pass a string with \Q or \U in it to the regex compiler
but not through the double-quote interpolation, such as 
if you read it from a file, then those do not happen.

Here are other things that are missing.  Perl release
numbers follow the convention that odd numbers are 
developer releases and even numbers are production releases.
I shall therefore only mention even-numbered releases.

 == Since the Perl 5.6 release of 2000, Perl also supports
these constructs not supported by the Java Pattern class:

  *  Unicode grapheme clusters via the \X.
  *  Unicode named characters (the Name property) using
 the \N{NAME} escape via the charnames pragma.
 This includes those from NameAliases.txt.
  *  ALL Unicode properties supported by whatever version 
 of the UCD is current at the time of release, not just
 those from UnicodeData.txt;  see 
 http://unicode.org/reports/tr44/#Property_Index for
 the current list, or the perluniprops manpage

Fwd: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-23 Thread Xueming Shen

Forwarding...forgot to include the list.

 Original Message 
Subject: 	Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 
RL1.4 Simple Word Boundaries and RL1.2 Properties

Date:   Sat, 23 Apr 2011 17:53:42 -0700
From:   Xueming Shen 
To: Tom Christiansen 

 Mark, Tom,

I agree with Mark that UNICODE_SPEC is a better name than
UNICODE_CHARSET. We will have to deal with
the "compatibility" issue Tom mentioned anyway anyway should Java go
higher level of Unicode Regex support
someday. New option/flag will have to be introduced to let the developer
to have the choice, just like what we
are trying to do with the ASCII only or Unicode version for those classes.

I also agree we should have an embedded flag. was thinking we can add it
later, for example the JDK8, if we
can get this one in jdk7, but the Pattern usage in String class is
persuasive.

The webrev, specdiff and Pattern doc have been updated to use
UNICODE_SPEC as the flag and (?U) as the
embedded flag. It might be a little confused, compared to we use (?u)
for UNICODE_CASE, but feel it might
feel "nature" to have uppercase "U" for broader Unicode support.

The webrev is at
http://cr.openjdk.java.net/~sherman/7039066/webrev/

 j.u.regex.Pattern API:
http://cr.openjdk.java.net/~sherman/7039066/Pattern.html

Specdiff:
http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html

Tom,  it would be appreciated if you can at lease give the doc update a
quick scan to see if I miss anything.
And thanks for the suggestions for the Perl related doc update, I will
need go through it a little later and address
it in a separate CR.

Thanks,
-Sherman

On 4/23/2011 10:48 AM, Tom Christiansen wrote:

 Mark Davis ☕   wrote
 on Sat, 23 Apr 2011 09:09:55 PDT:

 The changes sound good.

 They sure do, don't they?  I'm quite happy about this.  I think it is more
 important to get this in the queue than that it (necessarily) be done for
 JDK7.  That said, having a good tr18 RL1 story for JDK7's Unicode 6.0 debut
 makes it attractive now.  But if not now, then soon is good enough.

 The flag UNICODE_CHARSET will be misleading, since
 all of Java uses the Unicode Charset (= encoding). How about:
UNICODE_SPEC
 or something that gives that flavor.

 I hadn't thought of that, but I do see what you mean.  The idea is
 that the semantics of \w etc change to match the Unicode spec in tr18.

 I worry that UNICODE_SPEC, or even UNICODE_SEMANTICS, might be too
 broad a brush.  What then happens when, as I imagine it someday shall,
 Java gets full support for RL2.3 boundaries, the way with ICU one uses
 or (?w) or UREGEX_UWORD for?

 Wouldn't calling something UNICODE_SPEC be too broad? Or should
 UNICODE_SPEC automatically include not just existing Unicode flags
 like UNICODE_CASE, but also any UREGEX_UWORD that comes along?
 If it does, you have back-compat issue, and if it doesn't, you
 have a misnaming issue.  Seems like a bit of a Catch22.

 The reason I'd suggested UNICODE_CHARSET was because of my own background
 with the names we use for this within the Perl regex source code (which is
 itself written in C).  I believe that Java doesn't have the same situation
 as gave rise to it in Perl, and perhaps something else would be clearer.

 Here's some background for why we felt we had to go that way. To control
 the behavior of \w and such, when a regex is compiled, a compiled Perl
 gets exactly one of these states:

  REGEX_UNICODE_CHARSET
  REGEX_LOCALE_CHARSET
  REGEX_ASCII_RESTRICTED_CHARSET
  REGEX_DEPENDS_CHARSET

 That state it normally inherits from the surrounding lexical scope,
 although this can be overridden with /u and /a, or (?u) and (?a),
 either within the pattern or as a separate pattern-compilation flag.

 REGEX_UNICODE_CHARSET corresponds to out (?u), so \w and such all get the
 full RL1.2a definitions.  Because Perl always does Unicode casemapping --
 and full casemapping, too, not just simple -- we didn't need (?u) for what
 Java uses it for, which is just as an extra flavor of (?i); it doesn't
 do all that much.

  (BTW, the old default is *not* some sort of non-Unicode charset
  semantics, it's the ugly REGEX_DEPENDS_CHARSET, which is Unicode for
  code points>   255 and "maybe" so in the 128-255 range.)

 What we did certainly isn't perfect, but it allows for both backwards
 compat and future growth.  This was because people want(ed) to be able to
 use regexes on both byte arrays yet also on character strings.  Me, I think
 it's nuts to support this at all, that if you want an input stream in (say)
 CP1251 or ISO 8859-2, that you simply set that stream's encoding and be
 done with it: everything turns into characters internally.  But there's old
 byte and locale code out there whose semantics we are loth to change out
 from under people.  Java has the same kind of issue.

 The reason we ever support anything else is because we got (IMHO nasty)
 POSIX locales before we got Unicode

Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Suggested Perl-related updates for Pattern doc

Fwd: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

6 matches

Site Navigation

Mail list logo

Footer information