When I set about resolving the Unicode troubles in Java regular expressions through rewriting them into something Java understood, I found it convenient to divide that functionality into two different rewriting functions, one to handle string escapes like \uXXXX and the other to handle charclass escapes like \w. I will discuss the first of these two functions here in part 2, and the second of them in part 3.
Even though I consider it only an alpha prototype, this is fully working code that fulfills all the requirements I set. It is being used in a production environment, although this is an internal use, not one where it has been released to outside bodies. I am absolutely *not* advocating that this code be taken up as is by the JDK. Even if do you care to use some aspects of it--which you are perfectly welcome to do, BTW; we're 100% open source--I entertain no notion of it remaining recognizable. I discuss it here because it does manage to resolve almost all of the Level 1 compliance issues I have raised. Again, my rewrite code has two different functions: one for character escapes like \uXXXX, and one for charclass escapes like \w. CHARACTER ESCAPES ================= The one for character escapes translates *symbolically* specified character escapes found in the pattern into corresponding code points. It works on the following character escapes, of which only the last one, \x{...}, is entirely new to Java. -- \a \e \f \n \r \t [but *not* \b due to next function] -- \cX -- \0 \0N \0NN \N \NN \NNN (where N is any octal digit) -- \xXX (X=2) -- \uXXXX (X=4) -- \x{XXXXXX} (X = 1-8) This function serves at least four different purposes, which I shall elaborate on below: 1. Lets you unescape strings read in with embedded char escapes. 2. Makes strings and patterns accept the same escape syntax. 3. Resolves the RL1.1 niggle on hex notation. 4. Resolves the CANON_EQ bug interfering with satisfying RL2.1 Canonical Equivalents, which the JDK7 pattern docs claim met. There is to my knowledge no function in the core Java library that takes as input a string with character escapes and produces as output a new string with all the character escapes with literals. That's what this one does. (Purpose 1) You need to do this so that you can read in string from elsewhere than program literals and have them count the same as though they were a program literal. For example, command line arguments, configuration files, environment variables, user input, etc. (Purpose 2) Right now there are several subtle mismatchs between which character escapes work in Java's general string literals and which ones work only within strings that eventually make their way to Pattern.compile(). This function handles both sorts so you don't have to remember which is which. If need be, I can discuss what these mismatches are with precision. (Purpose 3) The new \x{...} borrowed from Perl lets you specify logical code points not physical 16-bit code units. That way you you can look directly at the code and know immediately what code point is meant without having to run a pair of them through a function to combine high and low surrogates. This thereby satisfies RL1.1 even to my satisfaction. (Purpose 4) There is a bug in the Pattern.compile() code in its handling of the CANON_EQ flag. It normalizes the input string before it parses it. That means that character literals are correctly normalized but character escapes are not. This function can be used as a workaround because if you first pass the string through this one before you send it to compile, the character escapes will have been turned into the needed literals already. This allows the Java Pattern class to meet RL2.1. That BTW is why this function does not translate \b into backspace as one would normally expect if it were only escaping strings: you have to leave them intact so that they can be word boundaries if used as regexes. Optimally the API could be designed so both would be possible. In the final part 3 of this letter, I will discuss the function I wrote that rewrites a Java regex's character-class escapes to make them work right on Unicode strings. --tom