Sherman, In part 1, I outlined my thinking of why having to make end-users think about represenation issues in regexes goes against if not perhaps the law, certainly to mind the spirit of UTS(tr)#18 when it says that a compliant "the regular expression engine provides support for Unicode characters as basic logical units."
Please understand that I don't think that is much of a big deal -- it's a rather low priority bug at worse -- because when you look at it from a particular perspective, it appears to be a surface-level matter only. (Also because it is easily addressed just by adding \x{XXX}, which is both simple and safe.) It's not that big of a deal because as you yourself point out, Sherman, you can still specify any code point, although you have to bend over sideways to do it. But it doesn't at all affect behaviour, which is by far the more important matter. These next two serialization concerns, however, are different. This time they are not just surface issues. They are actual behavioral problems in regexes that derive from the actual internal implementation of characters in Java: ** Surrogate Bugs in Regexes ** CANON_EQ Bugs in Regexes with \\uXXXX I don't think users should have to know about those implementational details, but if they don't, they will get several sorts of anomalous behaviour. I therefore believe those two are both geniuine bugs. I know exactly what is causing the second one (code included), but fixing it is going to require some code rearrangement and reworking. =========================== Surrogate Bugs in Regexes =========================== Here is one of them: Unicode UTF-16 Code Point String Pattern Result ========= ============== ========= ====== U+1F47E "\uD83D\uDC7E" /^.$/ true n/a "\uD83D" /^.$/ TRUE! I do not understand how that same pattern--which says to match strings containing a single Unicode code point only--can test on both those strings. That's why I believe the TRUE! result an error. Don't you? I understand that it brings up some tricky stuff. Consider: If you have a string "HL" where H is a high surrogate and L a low surrogate, Java's regex engine correctly concludes that that string "HL" exactly matches the pattern "^.$" in its entirety; it has just one logical character in it. This is correct. It fails to match "^..$", which is also correct and for the same reason. However, if you flip those around to get string "LH", it now exactly matches the pattern "^..$" in its entirety, thus claiming it holds exactly two characters even there are no legal code points there! If you have just one of the two surrogates, either "H" or "L", both of those will also match "^.$" just as "HL" does. That says that a single surrogate is just as much a single logical character as a proper pair of them together is just a single logical character. But that makes no senses at all. How can both be correct? Surely that *must* be a bug? What am I not understanding here? I really think that rather than returning true for something that isn't even a legal Unicode code point, it should instead either 1: raise an exception and/or 2: admit some pattern flag to deal with such cases I say this because you are not supposed to have to deal representation and serialization issues in regexes, and this makes you think about them. It also gives you bizarre answers even when you do think about them. ======================================= CANON_EQ Bugs in Regexes with \\uXXXX ======================================= Another place where you are forced to think about the internal representation in Java regexes, is that they can behave differently if you pass things in as "\\uXXXX" instead of as "\uXXXX". I don't think that can be correct behaviour, either. The problem is that the CANON_EQ can no longer be trusted. If you compile up these patterns with CANON_EQ, then it makes a difference whether you've used a literal or a \u0000 form. Please consider these, as I believe that FALSE! results below are all in error: String Pattern w/CANON_EQ Result ========= ============ ========= A : "\u00E9" "^\u00E9$" true B : "\u00E9" "^e\u0301$" true A': "\u00E9" "^\\u00E9$" true B': "\u00E9" "^e\\u0301$" FALSE! C : "e\u0301" "^\u00E9$" true D : "e\u0301" "^e\u0301$" true C': "e\u0301" "^\\u00E9$" FALSE! D': "e\u0301" "^e\\u0301$" true The ABCD versions all use literals converted during the lexical substitution phase, whereas the prime versions use UTF-16 code units that get passed into the regex compiler for it to consider. (This second mechanism is indispensable to meet the requirement of being able to code up any code point, and to facilitate reading patterns written in ASCII but specifying trans-ASCII code points.) You get the same problem with octal notation: you can specify U+E9 as "\351" for the prepass literal (which works), or as "\\0351" for the regex engine to see (which fails just as \\u did): String Pattern w/CANON_EQ Result ========= ============ ========= a : "\u00E9" "^\351$" true a': "\u00E9" "^\\0351$" true c : "e\u0301" "^\351$" true c': "e\u0301" "^\\0351$" FALSE! As you might predict, using UTF-8 directly in your code and compiling with "java -encoding UTF-8" behaves exactly as the non-prime "\uXXXX" versions do, but which can be different from how the prime "\\uXXXX" version behave. >From looking at the code, I am sure I can reproduce this with \xXX escapes as well. That's because you do the normalization reshuffle before you actually compile the pattern, so you won't see the octal or hex escapes when you're doing the normalization. The bug is right here in this code right here, from around line 1500 of jdk1.7.0/java/util/regex/Pattern.java: /** * Copies regular expression to an int array and invokes the parsing * of the expression which will create the object tree. */ private void compile() { // Handle canonical equivalences if (has(CANON_EQ) && !has(LITERAL)) { normalize(); } else { normalizedPattern = pattern; } patternLength = normalizedPattern.length(); // Copy pattern to int array for convenience // Use double zero to terminate pattern temp = new int[patternLength + 2]; Because things like \cC and \0XXX and \xXX and \uXXXX all get handled *after* that point in the code, they are *not* the same as literals with those values. This is a genuine problem. So again we have to think about how things are stored. It means that you cannot just read in patterns that have had there non-ASCII converted into \uXXXX escapes and have them work the same as having the literals in there. Those are supposed to be the same as the literals, but they're not. This is quite apart from the--um, "syntactic infelicity"?--of the mismatch between how octal excapes are specified in the lexical substitution pass versus how they're specified in the regex engine. That, I wouldn't quite call a bug so much as an unexpected wrinkle. I do fix this in my regex rewriter, BTW. (There are "syntactic infelicities" with \cC, too. It is a bit too undiscerning, producing things that aren't guaranteed to be control characters because it blindly xors whatever follows it with 64. For example, \c} is = and \c= is }, \cé is © and \c© is é, etc. ) This is message is far too long again, so I will discuss your comments regarding the j.l.Character class in part 3 of 3, to be sent later on. Thanks again! --tom