Sherman wrote: > Oh, I see the problem. Obviously I have been working on jdk7 too long > and forgot the latest release is still 6:-( There is indeed a bug in > the previous implementation which I fixed in 7 long time ago (I > mentioned this in one of the early emails but was not specific, my > apology), probably should backport to 6 update release asap. The test > case runs well (the "failures" in literals are expected)
Could you please elaborate a bit on that? Code points specified by value are not to be re-evaluated for pattern-syntax senses ("meta- ness"). Could you please show one sample string and one sample regex containing a "\\uXXXX" mention that you expect to fail? There should be no failures at all when doing that. I tried setting up some smaller tests, but I encountered bugs in the regex compiler, so I don't trust anything. The bug I encountered was matching using the string "*" and the pattern "^*$". Java fails to detect that is an invalid regex. You cannot quantify a zero-width assertion; it should have raised an exception. Apparently the compiler is tricked into thinking that is a literal "*" there. That's why I don't trust my correctness tests on literalness. > on 7 with the following output. I modified your test case "slightly" > since it appears the UnicodeSet class in our normalizer package does > not have the size(), replace it with a normal hashset. Does that mean the following now works? 1. a+b matches "[" + a + b + "]+" 2. b+a matches "[" + a + b + "]+" 3. a+b matches "[" + b + a + "]+" 4. b+a matches "[" + b + a + "]+" When a and b take on every Unicode code point, meaning from U+0000 up to U+10FFFF? If they do not, then one is not specifying Unicode code points. Please correct me if I am wrong, but I believe the following code showing how logical code points are *never* mistaken with their serialization representations is conforming behaviour--and that results other than these would indicate nonconforming behavior: $ perl -le 'print "\x{1033c}" =~ /^\x{1033c}$/ || 0' 1 $ perl -le 'print "\xF0\x90\x8C\xBC" =~ /^\x{1033c}$/ || 0' 0 $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^\xF0\x90\x8C\xBC$/ || 0' 0 $ perl -le 'print "\x{1033c}" =~ /^\x{D800}\x{DF3c}+$/ || 0' 0 $ perl -le 'print "\x{1033c}" =~ /^[\x{D800}\x{DF3c}]+$/ || 0' 0 $ perl -le 'print "\x{1033c}" =~ /^\xF0\x90\x8C\xBC$/ || 0' 0 $ perl -le 'print "\x{1033c}" =~ /^[\xF0\x90\x8C\xBC]+$/ || 0' 0 $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0' 1 $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0' 1 $ perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0' 1 $ perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0' 1 Can Java do that yet? If not, then \uXXXX does not meet RL1.1, and one appears to need \x{} or its equivalent to do so--with the proviso from the top of this message that it must not be double evaluated for meta characters: \x{} must always be a literal code point of that number without regard to reinterpretation as UTF-16 or as pattern syntax. I'm sorry if this is too terse. I do not mean to be in the least bit confrontational! I apologize in advance if sounds that way; I really do not intend it. It is possible that I have a different way of looking at regexes than Java folks have historically considered them. Even if so, I believe my way of looking at them accords with tr18's RL1.1 in both its letter and its spirit, and that Java's current way fails to meet that requirement in either sense. --tom #!/bin/sh # expected results: 1 0 0 0 0 0 0 1 1 1 1 perl -le 'print "\x{1033c}" =~ /^\x{1033c}$/ || 0' perl -le 'print "\xF0\x90\x8C\xBC" =~ /^\x{1033c}$/ || 0' perl -le 'print "\x{D800}\x{DF3C}" =~ /^\xF0\x90\x8C\xBC$/ || 0' perl -le 'print "\x{1033c}" =~ /^\x{D800}\x{DF3c}+$/ || 0' perl -le 'print "\x{1033c}" =~ /^[\x{D800}\x{DF3c}]+$/ || 0' perl -le 'print "\x{1033c}" =~ /^\xF0\x90\x8C\xBC$/ || 0' perl -le 'print "\x{1033c}" =~ /^[\xF0\x90\x8C\xBC]+$/ || 0' perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0' perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0' perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0' perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'